100+ datasets found
  1. CCDV Arxiv Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
    Explore at:
    zip(2219742528 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CCDV Arxiv Summarization Dataset

    Arxiv Summarization Dataset for CCDV

    By ccdv (From Huggingface) [source]

    About this dataset

    The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

    The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

    Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

    With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

    How to use the dataset

    • Introduction:

    • File Description:

    • validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

    • train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

    • test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

    • Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

    • Usage Examples: This dataset can be utilized in various ways:

    a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

    b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

    c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

    • Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

    Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

    Research Ideas

    • Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
    • Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
    • Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...

  2. Combined dataset from Arxiv and Wikipedia

    • kaggle.com
    Updated Feb 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monica Avagyan (2025). Combined dataset from Arxiv and Wikipedia [Dataset]. https://www.kaggle.com/datasets/monikaavagyan/combined-dataset-from-arxiv-and-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2025
    Dataset provided by
    Kaggle
    Authors
    Monica Avagyan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is a combination of **arXiv **and Wikipedia data, containing metadata and textual content related to academic publications and encyclopedic knowledge. It includes key attributes such as authors, titles, digital object identifiers (DOI), categories, abstracts, update dates, URLs, and full text content of documents.

    Key Features Authors: Names of the researchers or contributors to each document. Title: The title of the publication or article. DOI: A unique identifier for academic papers. Categories: Classification labels indicating the subject area (e.g., astrophysics, mathematics, physics). **Abstract: **A summary of the document’s content. Update Date: The most recent modification date of the document. URL: A link to the full document. Text: The full textual content of the document.

    The dataset contains over 2.6 million unique values for text-based fields. Use Cases Academic research analysis: Studying trends in scientific publications over time. Natural Language Processing (NLP): Developing models for summarization, classification, and text generation. Knowledge extraction: Identifying key themes and topics in scientific and encyclopedic data. Citation and impact studies: Analyzing author influence and research impact based on citations. This dataset is a valuable resource for text mining, AI training, and scientific knowledge analysis, providing a rich blend of structured metadata and unstructured text.

  3. Z

    unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured...

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    Updated Nov 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saier, Tarek; Krause, Johan; Färber, Michael (2023). unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network (open subset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7752614
    Explore at:
    Dataset updated
    Nov 3, 2023
    Dataset provided by
    Karlsruhe Institute of Technology
    Authors
    Saier, Tarek; Krause, Johan; Färber, Michael
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Description unarXive is a scholarly data set containing publications' structured full-text, annotated in-text citations, linked non-text content (mathematical notation, figure/table captions) and a citation network. The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files. Typical uses are

    Training of ML models (citation recommendation, summarization, LLMs) Citation context analysis Bibliographic analyses Access ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ D O W N L O A D S A M P L E  ┃┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛ Regarding the full data set, please note the following:

    Note: this Zenodo record is the "open subset" of unarXive, which contains all permissively licensed papers from arXiv.org. You can find the full version here. The code used for generating the data set is publicly available.

  4. ArXiv CS Papers Multi-Label Classification (200K)

    • kaggle.com
    zip
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharukh Rahman (2023). ArXiv CS Papers Multi-Label Classification (200K) [Dataset]. https://www.kaggle.com/datasets/devintheai/arxiv-cs-papers-multi-label-classification-200k-v1
    Explore at:
    zip(83841332 bytes)Available download formats
    Dataset updated
    Jun 7, 2023
    Authors
    Sharukh Rahman
    Description

    The ArXiv CS Papers Multi-Label Classification dataset is a comprehensive collection of research papers from the computer science domain. This dataset is intended for multi-label classification tasks and contains a diverse range of research papers spanning various topics within computer science.

    The dataset consists of approximately 200,000+ research papers and includes the following columns:

    • Paper ID: A unique identifier for each research paper in the dataset.
    • Title: The title of the research paper.
    • Abstract: A brief summary or abstract of the research paper.
    • Year: The publication year of the research paper.
    • Primary Category: The primary category of the research paper, representing the main topic or area of focus.
    • Categories: Additional categories or subtopics associated with the research paper.

    This dataset is well-suited for tasks related to text classification, topic modeling, information retrieval, and other natural language processing (NLP) tasks. Researchers and practitioners can leverage this dataset to develop and evaluate machine learning models for multi-label classification on a wide range of computer science topics.

    Note: Please refer to the original ArXiv repository for access to the full-text content of the papers and proper citation guidelines. This dataset contains metadata and should be used for research and educational purposes only.

    We hope that the ArXiv CS Papers Multi-Label Classification dataset serves as a valuable resource for researchers, data scientists, and machine learning enthusiasts in their quest to advance knowledge and understanding in the field of computer science.

  5. Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...

    • springernature.figshare.com
    application/csv
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal (2024). A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [Dataset]. http://doi.org/10.6084/m9.figshare.26133973.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

    We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.

    We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].

    • Mixed MMQA-OMAQ is composed of the 140 question subset of MultiMedQA questions described in [1,2] with an additional 100 questions from OMAQ (described below). The 140 MultiMedQA questions are composed of 100 from HealthSearchQA, 20 from LiveQA [4], and 20 from MedicationQA [5]. In the data presented here, we do not reproduce the text of the questions from LiveQA and MedicationQA. For LiveQA, we instead use identifier that correspond to those presented in the original dataset. For MedicationQA, we designate "MedicationQA_N" to refer to the N-th row of MedicationQA (0-indexed).

    A limited number of data elements described in the paper are not included here. The following elements are excluded:

    1. The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.

    2. The free-text comments written by raters during the ratings process.

    3. Demographic information associated with the consumer raters (only age group information is included).

    References

    1. Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).

    2. Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2

    3. Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z

    4. Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.

    5. Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.

    Description of files and sheets

    1. Independent Ratings [ratings_independent.csv]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence with three possible values (No bias, Minor bias, Severe bias). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.

    2. Paired Ratings [ratings_pairwise.csv]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias). Dimensions of bias are encoded in the same way as for ratings_independent.csv. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.

    3. Counterfactual Paired Ratings [ratings_counterfactual.csv]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff, how_answers_diff). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.

    4. Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].

    5. Equity in Health AI (EHAI) [equitymedqa_ehai.csv]: Contains questions that compose the EHAI dataset.

    6. Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv]: Contains questions that compose the FBRT-Manual dataset.

    7. Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv]: Contains questions that compose the extended FBRT-LLM dataset.

    8. Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.

    9. TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv]: Contains questions that compose the TRINDS dataset.

    10. Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv]: Contains pairs of questions that compose the CC-Manual dataset.

    11. Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv]: Contains pairs of questions that compose the CC-LLM dataset.

    12. HealthSearchQA [other_datasets_healthsearchqa.csv]: Contains questions sampled from the HealthSearchQA dataset [1,2].

    13. Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq]: Contains questions that compose the Mixed MMQA-OMAQ dataset.

    14. Omiye et al. [other datasets_omiye_et_al]: Contains questions proposed in Omiye et al. [3].

    Version history

    Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)

    WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.

    NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.

  6. Data from: Beyond designer’s knowledge-Generating materials design...

    • figshare.com
    bin
    Updated Sep 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    quanliang liu (2024). Beyond designer’s knowledge-Generating materials design hypotheses via a large language model [Dataset]. http://doi.org/10.6084/m9.figshare.26322460.v4
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 13, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    quanliang liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data supporting the paper entitled "Beyond designer’s knowledge: Generating materials design hypotheses via a large language model". For detailed information about the file contents, see 'Folder Contents Description.pdf'.Provider is willing to license its rights in the Prompts and Codes (“Provider’s Rights”) to academic researchers to use free of charge solely for academic, non-commercial research purposes subject to the terms and conditions outlined herein. The Prompts and Codes were created at the University of Wisconsin ("UW") by Quanliang Liu and Hyunseok Oh. Please note Provider's Rights may include, but are not limited to, certain patents or patent applications owned by the Wisconsin Alumni Research Foundation (“WARF”).https://arxiv.org/abs/2409.06756

  7. h

    arxiv-image-text

    • huggingface.co
    Updated Sep 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nopperl (2023). arxiv-image-text [Dataset]. https://huggingface.co/datasets/nopperl/arxiv-image-text
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2023
    Authors
    nopperl
    License

    https://choosealicense.com/licenses/pddl/https://choosealicense.com/licenses/pddl/

    Description

    arXiv Figures Dataset

    This dataset contains image-text pairs extracted from figures from papers published until the end of 2020 in the arXiv repository. The dataset can be used to train CLIP models. This repo contains a Parquet file containing the metadata of a WebDataset in img2dataset format. The images themselves are not distributed and need to be retrieved. Note that the images cannot be retrieved by an HTTP URL, so img2dataset cannot be used as is to retrieve the data. Instead… See the full description on the dataset page: https://huggingface.co/datasets/nopperl/arxiv-image-text.

  8. Z

    Supplementary Data: Status of the scalar singlet dark matter model...

    • data-staging.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The GAMBIT Collaboration (2020). Supplementary Data: Status of the scalar singlet dark matter model (arXiv:1705.07931) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_801510
    Explore at:
    Dataset updated
    Jan 24, 2020
    Authors
    The GAMBIT Collaboration
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Data

    Status of the scalar singlet dark matter model arXiv:1705.07931

    The files in this record contain data for the scalar singlet dark matter model considered in the GAMBIT "Round 1" scalar singlet paper.

    The files consist of

    Three YAML files, each corresponding to a different parameter range

    StandardModel_SLHA2_SingletDM_scan_15.yaml, a universal YAML fragment included from the other three YAML files

    Three hdf5 files. SingletDM.hdf5 contains the combined results of all sampling runs, and is the basis for the profile likelihood plots in the paper. SingletDM_TW_full.hdf5 and SingletDM_TW_lowmass.hdf5 contain the results from T-Walk scans over the full and low-mass parameter ranges, respectively. These are the bases for the marginalised posterior plots in the paper.

    An example pip file corresponding to each hdf5 file, for producing plots using pippi

    A tarball best_fits_yaml.tar.gz containing YAML files of the best-fit point in each subregion of the fit.

    The YAML files corresponding to different parameter ranges follow the naming scheme SingletDM_[slice].yaml, where slice may be full, lowmass or neck. Each of these YAML files contains entries in the Scanners node for running Diver, MultiNest, TWalk and GreAT.

    A few caveats to keep in mind:

    The YAML files that we give here are updated compared to the ones that we used when generating the hdf5 file, in order to match the set of available options in the release version of GAMBIT 1.0.0. The included physics and numerics are however identical.

    The YAML files are designed to work with the tagged release of GAMBIT 1.0.0, and the pip file is tested with pippi 2.0, commit 2ab061a8. They may or may not work with later versions of either software (but you can of course always obtain the version that they do work with via the git history).

    The pip file is an example only. Users wishing to reproduce the more advanced plots in any of the GAMBIT papers should contact us for tips or scripts, or experiment for themselves. Many of these scripts are in multiple parts and require undocumented manual interventions and steps in order to implement various plot-specific customisations, so please don't expect the same level of polish as for files provided here or in the GAMBIT repo.

  9. Data from: Extracting Accurate Materials Data from Research Papers with...

    • figshare.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dane Morgan; Maciej Polak (2023). Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering [Dataset]. http://doi.org/10.6084/m9.figshare.22213747.v5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Dane Morgan; Maciej Polak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data supporting the paper entitled "Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering" by Maciej P. Polak and Dane Morganhttps://arxiv.org/abs/2303.05352BulkModulus_test_database_MPPolak_DMorgan.xlsx - dataset of bulk modulus text passages and sentences used for methods assessment.CriticalCoolingRates_MGs_database_MPPolak_DMorgan.xlsx - a database of critical cooling rates of metallic glasses. The data is presented in three versions and (described in detail in the paper), i.e. "raw", "cleaned", and "standardized". The critical cooling rate additionally includes manually extracted data serving as ground truth for tests, in sheets labeled as "manual". In addition, a "standardized_MG" database is included, which limits the results to metallic glasses only, together with "standardized_tables_MG" for values extracted from tables, and "Figure_Classification" which contains Figure numbers, captions, and DOIs of their source documents.YieldStrength_HEAs_database_MPPolak_DMorgan.xlsx - a database of yield strengths in the context of high entropy alloys. The data is presented in three versions and (described in detail in the paper), i.e. "raw", "cleaned", and "standardized". In addition, a "standardized_HEA" database is included, which limits the results to HEAs only, together with "standardized_tables_HEA" for values extracted from tables, and "Figure_Classification" which contains Figure numbers, captions, and DOIs of their source documents.ChatExtract_Code_MPPolak_DMorgan.zip - These files contain the ChatExtract code with a short example and instructions.

  10. Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities...

    • figshare.com
    zip
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dane Morgan; Maciej P. Polak (2025). Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots" [Dataset]. http://doi.org/10.6084/m9.figshare.28559639.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 21, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Dane Morgan; Maciej P. Polak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots"https://arxiv.org/abs/2503.12326This repository contains datasets and tools related to PlotExtract, a pipeline for automated plot digitization using LLM-based vision models. Below is a description of the key components:Dataset Output Files*.out_data - Results of LLM-based visual data extraction from plot images. These files contain the extracted data points in CSV-like format.*.out_code - Python code generated by the LLM to recreate the source plot using the extracted data.*.out_conversation - Full conversations with the LLM conducted by PlotExtract, including prompts and responses.interpolated_* - Visual and statistical comparisons based on interpolation between the LLM-extracted data and the ground-truth. These correspond to the interpolation accuracy assessments described in the paper.pointwise_* - Visual and statistical comparisons on a point-by-point basis between extracted and ground-truth data. These correspond to pointwise accuracy evaluations from the main text.*.stats - Numerical summaries of extraction accuracy, referenced in the associated visual comparisons.*.csv - Manually extracted ground truth data used as reference for evaluating extraction accuracy.All of the above files are generated automatically during PlotExtract execution.Published, Synthetic, and chartQA DatasetThe Published Dataset does not include original plot images due to copyright restrictions. Instead, each plot is referenced in source_images.csv, which lists:DOI of the source publicationFigure numberFilename used in this datasetThe Synthetic Dataset includes synthetic plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes.The chartQA Dataset (https://doi.org/10.48550/arXiv.2203.10244) includes chartQA plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes. There are two equivaent datasets: FULL and CROPPED, the first one containing original images and the second one containing images cropped as much as possible to preserve the plot only and remove additional text.CodesAll source code, including PlotExtract and supporting scripts for evaluation and comparison, is included in MPPolak_DMorgan_PlotExtract_Codes.zip.Each script contains usage instructions in-line and is intended to be self-explanatory for users familiar with Python-based data processing workflows.

  11. Data from: Large Language Model

    • zenodo.org
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gregory Diamos; Mostofa; Gregory Diamos; Mostofa (2020). Large Language Model [Dataset]. http://doi.org/10.5281/zenodo.1492880
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gregory Diamos; Mostofa; Gregory Diamos; Mostofa
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    One of the large language models trained in this paper: https://arxiv.org/abs/1810.10045

  12. Summary of citation networks arXiv-HepTh and arXiv-HepPh.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuichiro Yasui; Junji Nakano (2023). Summary of citation networks arXiv-HepTh and arXiv-HepPh. [Dataset]. http://doi.org/10.1371/journal.pone.0269845.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yuichiro Yasui; Junji Nakano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of citation networks arXiv-HepTh and arXiv-HepPh.

  13. h

    subset_arxiv_papers_with_embeddings

    • huggingface.co
    Updated Jun 30, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MongoDB (2015). subset_arxiv_papers_with_embeddings [Dataset]. https://huggingface.co/datasets/MongoDB/subset_arxiv_papers_with_embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2015
    Dataset authored and provided by
    MongoDB
    Description

    This dataset is a curated subset of the original arXiv dataset, each entry enriched with a 256-dimensional embedding vector. The embeddings are generated using OpenAI's "text-embedding-3-small" model. For each data point, the embedding is created by concatenating the text of the title, author(s), and abstract into a single string, which is then processed by the embedding model. This approach captures the semantic essence of each document, facilitating tasks such as similarity search… See the full description on the dataset page: https://huggingface.co/datasets/MongoDB/subset_arxiv_papers_with_embeddings.

  14. 3M+ Academic Papers: Titles & Abstracts

    • kaggle.com
    zip
    Updated Sep 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Arias (2025). 3M+ Academic Papers: Titles & Abstracts [Dataset]. https://www.kaggle.com/datasets/beta3logic/3m-academic-papers-titles-and-abstracts
    Explore at:
    zip(1478156333 bytes)Available download formats
    Dataset updated
    Sep 18, 2025
    Authors
    David Arias
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Comprehensive Academic Papers Dataset: 3M+ Research Paper Titles and Abstracts

    📋 Overview

    This dataset is a comprehensive collection of over 3 million research paper titles and abstracts, curated and consolidated from multiple high-quality academic sources. The dataset provides a unified, clean, and standardized format for researchers, data scientists, and machine learning practitioners working on natural language processing, academic research analysis, and knowledge discovery tasks.

    🎯 Key Features

    • 3.6+ million scientific papers with titles and abstracts
    • Multi-domain coverage: Physics, Mathematics, Computer Science, Biology, Medicine, and more
    • Standardized format: Consistent title and abstract columns
    • Quality assured: Validated using Pydantic models and cleaned of duplicates/null values
    • Ready-to-use: Pre-processed and formatted for immediate analysis
    • Format: CSV
    • Language: English

    📊 Dataset Statistics

    MetricValue
    Total Records~3,000,000+
    Columns2 (title, abstract)
    File Size4.15 GB
    FormatCSV
    DuplicatesRemoved
    Missing ValuesRemoved

    🗂️ Dataset Structure

    cleaned_papers.csv
    ├── title (string): Scientific paper title
    └── abstract (string): Scientific paper abstract
    

    🔄 Data Processing Pipeline

    The dataset underwent a rigorous cleaning and standardization process:

    1. Data Import: Automated import from multiple sources (Kaggle API, Hugging Face)
    2. Column Standardization: Mapping various column names to consistent title and abstract format
    3. Data Validation: Pydantic model validation ensuring data quality
    4. Duplicate Removal: Advanced deduplication based on title and abstract similarity
    5. Null Value Handling: Removal of records with missing titles or abstracts
    6. Quality Assurance: Final validation and statistics generation

    💡 Use Cases

    This dataset is ideal for:

    • Natural Language Processing: Text classification, sentiment analysis, topic modeling
    • Scientific Literature Analysis: Trend analysis, domain classification, citation prediction
    • Machine Learning Research: Training language models, text summarization, information extraction
    • Academic Research: Bibliometric analysis, research trend identification
    • Educational Applications: Building search engines, recommendation systems

    🔗 Data Sources and Attribution

    This dataset consolidates academic papers from the following sources:

    Kaggle Datasets:

    1. ArXiv Scientific Research Papers Dataset by @sumitm004
    2. Cornell University ArXiv Dataset by @Cornell-University

    Hugging Face Datasets:

    1. ML-ArXiv-Papers by @CShorten
    2. ArXiv Biology by @zeroshot
    3. ArXiv Data Extended by @wrapper228
    4. Stroke PubMed Abstracts by @Gaborandi
    5. PubMed ArXiv Abstracts Data by @brainchalov
    6. Abstracts Cleaned by @Eitanli

    🔄 Update Schedule

    This dataset represents a point-in-time consolidation. Future versions may include: - Additional academic sources - Extended fields (authors, publication dates, venues) - Domain-specific subsets - Enhanced metadata

    📄 License and Usage

    Please respect the individual licenses of the source datasets. This consolidated version is provided for research and educational purposes. When using this dataset:

    1. Citation: Please cite this dataset and acknowledge the original data sources
    2. Attribution: Credit the original dataset creators listed above
    3. Compliance: Ensure compliance with individual dataset licenses
    4. Academic Use: Primarily intended for non-commercial, academic, and research purposes

    🙏 Acknowledgments

    Special thanks to all the original dataset creators and the academic communities that make their research data publicly available. This work builds upon their valuable contributions to open science and knowledge sharing.

    Keywords: academic papers, research abstracts, NLP, machine learning, text mining, scientific literature, ArXiv, PubMed, natural language processing, research dataset

  15. h

    arxiv-hard-negatives-cross-encoder

    • huggingface.co
    Updated Apr 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aarush (2025). arxiv-hard-negatives-cross-encoder [Dataset]. https://huggingface.co/datasets/chungimungi/arxiv-hard-negatives-cross-encoder
    Explore at:
    Dataset updated
    Apr 20, 2025
    Authors
    Aarush
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains hard negative examples generated using cross-encoders for training dense retrieval models. The data was used in the paper Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval. This dataset is part of the Hugging Face collection: arxiv-hard-negatives-68027bbc601ff6cc8eb1f449 Please cite if you found this dataset useful 🤗 @misc{sinha2025dontretrievegenerateprompting, title={Don't Retrieve, Generate: Prompting LLMs for Synthetic… See the full description on the dataset page: https://huggingface.co/datasets/chungimungi/arxiv-hard-negatives-cross-encoder.

  16. C

    Replication Data for: The economics of lost knowledge: modeling the...

    • dataverse.csuc.cat
    • portalrecerca.udl.cat
    • +1more
    application/gzip +3
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorge Chamorro-Padial; Jorge Chamorro-Padial; Francisco-Javier Rodrigo-Ginés; Francisco-Javier Rodrigo-Ginés; Rosa María Rodríguez Sánchez; Rosa María Rodríguez Sánchez; Rosa Maria Gil Iranzo; Rosa Maria Gil Iranzo; Roberto García González; Roberto García González (2025). Replication Data for: The economics of lost knowledge: modeling the knowledge cost due to non-FAIR data practices [Dataset]. http://doi.org/10.34810/data2382
    Explore at:
    bin(142832628), application/vnd.sqlite3(81457152), txt(1917), bin(161598299), txt(6698), application/gzip(3776271875)Available download formats
    Dataset updated
    Jun 20, 2025
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Jorge Chamorro-Padial; Jorge Chamorro-Padial; Francisco-Javier Rodrigo-Ginés; Francisco-Javier Rodrigo-Ginés; Rosa María Rodríguez Sánchez; Rosa María Rodríguez Sánchez; Rosa Maria Gil Iranzo; Rosa Maria Gil Iranzo; Roberto García González; Roberto García González
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Dataset funded by
    Agencia Estatal de Investigación
    Description

    This file contains the replication data for the paper "The Economics of Lost Knowledge: Modeling the Knowledge Cost Due to Non-FAIR Data Practices." It includes the two networks used in the paper, arXiv and OpenAlex, a SQLite database to check whether a link is available on the internet, and the raw network as extracted from arXiv. Aquest fitxer conté les dades de replicació de l’article "L’economia del coneixement perdut: modelant el cost del coneixement degut a pràctiques de dades no FAIR." Inclou les dues xarxes utilitzades a l’article, arXiv i OpenAlex, una base de dades SQLite per comprovar si un enllaç està disponible a internet, i la xarxa en brut tal com va ser extreta d’arXiv. Este archivo contiene los datos de replicación del artículo "La economía del conocimiento perdido: modelando el coste del conocimiento debido a prácticas de datos no FAIR." Incluye las dos redes utilizadas en el artículo, arXiv y OpenAlex, una base de datos SQLite para comprobar si un enlace está disponible en internet, y la red en bruto tal como fue extraída de arXiv.

  17. H

    Replication Data for: Order Book Queue Hawkes-Markovian Modeling

    • dataverse.harvard.edu
    Updated Jul 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shihao Yang (2021). Replication Data for: Order Book Queue Hawkes-Markovian Modeling [Dataset]. http://doi.org/10.7910/DVN/ZIRH84
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Shihao Yang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Replication Data for: Order Book Queue Hawkes-Markovian Modeling. Manuscript available at https://arxiv.org/abs/2107.09629.

  18. Z

    Data from: Accelerometer-Based Multivariate Time-Series Dataset for Calf...

    • data.niaid.nih.gov
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dissanayake, Oshana; McPherson, Sarah E.; Allyndrée, Joseph; Kennedy, Emer; Cunningham, Padraig; Riaboff, Lucile (2024). Accelerometer-Based Multivariate Time-Series Dataset for Calf Behavior Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13259481
    Explore at:
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    University College Dublin
    VistaMilk SFI Research Centre, Ireland
    Authors
    Dissanayake, Oshana; McPherson, Sarah E.; Allyndrée, Joseph; Kennedy, Emer; Cunningham, Padraig; Riaboff, Lucile
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AcTBeCalf Dataset Description

    The AcTBeCalf dataset is a comprehensive dataset designed to support the classification of pre-weaned calf behaviors from accelerometer data. It contains detailed accelerometer readings aligned with annotated behaviors, providing a valuable resource for research in multivariate time-series classification and animal behavior analysis. The dataset includes accelerometer data collected from 30 pre-weaned Holstein Friesian and Jersey calves, housed in group pens at the Teagasc Moorepark Research Farm, Ireland. Each calf was equipped with a 3D accelerometer sensor (AX3, Axivity Ltd, Newcastle, UK) sampling at 25 Hz and attached to a neck collar from one week of birth over 13 weeks.

    This dataset encompasses 27.4 hours of accelerometer data aligned with calf behaviors, including both prominent behaviors like lying, standing, and running, as well as less frequent behaviors such as grooming, social interaction, and abnormal behaviors.

    The dataset consists of a single CSV file with the following columns:

    dateTime: Timestamp of the accelerometer reading, sampled at 25 Hz.

    calfid: Identification number of the calf (1-30).

    accX: Accelerometer reading for the X axis (top-bottom direction)*.

    accY: Accelerometer reading for the Y axis (backward-forward direction)*.

    accZ: Accelerometer reading for the Z axis (left-right direction)*.

    behavior: Annotated behavior based on an ethogram of 23 behaviors.

    segId: Segment identification number associated with each accelerometer reading/row, representing all readings of the same behavior segment.

    • the directions are mentioned in relation to the position of the accelerometer sensor on the calf.

    Code Files Description

    The dataset is accompanied by several code files to facilitate the preprocessing and analysis of the accelerometer data and to support the development and evaluation of machine learning models. The main code files included in the dataset repository are:

    accelerometer_time_correction.ipynb: This script corrects the accelerometer time drift, ensuring the alignment of the accelerometer data with the reference time.

    shake_pattern_detector.py: This script includes an algorithm to detect shake patterns in the accelerometer signal for aligning the accelerometer time series with reference times.

    aligning_accelerometer_data_with_annotations.ipynb: This notebook aligns the accelerometer time series with the annotated behaviors based on timestamps.

    manual_inspection_ts_validation.ipynb: This notebook provides a manual inspection process for ensuring the accurate alignment of the accelerometer data with the annotated behaviors.

    additional_ts_generation.ipynb: This notebook generates additional time-series data from the original X, Y, and Z accelerometer readings, including Magnitude, ODBA (Overall Dynamic Body Acceleration), VeDBA (Vectorial Dynamic Body Acceleration), pitch, and roll.

    genSplit.py: This script provides the logic used for the generalized subject separation for machine learning model training, validation and testing.

    active_inactive_classification.ipynb: This notebook details the process of classifying behaviors into active and inactive categories using a RandomForest model, achieving a balanced accuracy of 92%.

    four_behv_classification.ipynb: This notebook employs the mini-ROCKET feature derivation mechanism and a RidgeClassifierCV to classify behaviors into four categories: drinking milk, lying, running, and other, achieving a balanced accuracy of 84%.

    Kindly cite one of the following papers when using this data:

    Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Evaluating ROCKET and Catch22 features for calf behaviour classification from accelerometer data using Machine Learning models. arXiv preprint arXiv:2404.18159.

    Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Development of a digital tool for monitoring the behaviour of pre-weaned calves using accelerometer neck-collars. arXiv preprint arXiv:2406.17352

  19. m

    Expressive Gaussian mixture models for high-dimensional statistical...

    • figshare.manchester.ac.uk
    • opendatalab.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darren Price; Stephen Menary (2023). Expressive Gaussian mixture models for high-dimensional statistical modelling: simulated data and neural network model files [Dataset]. http://doi.org/10.48420/17136839.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    University of Manchester
    Authors
    Darren Price; Stephen Menary
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Neural network model files and Madgraph event generator outputs used as inputs to the results presented in the paper "Learning to discover: expressive Gaussian mixture models for multi-dimensional simulation and parameter inference in the physical sciences" arXiv:2108.11481Code and model files can be found at:https://github.com/darrendavidprice/science-discovery/tree/master/expressive_gaussian_mixture_models

  20. Training CNNs with Low-Rank Filters for Efficient Image Classification:...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    application/gzip, csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yani Ioannou; Yani Ioannou (2020). Training CNNs with Low-Rank Filters for Efficient Image Classification: Trained Models [Dataset]. http://doi.org/10.5281/zenodo.53189
    Explore at:
    application/gzip, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yani Ioannou; Yani Ioannou
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Models from experiments referenced in the paper "Training CNNs with Low-Rank Filters for Efficient Image Classification", https://arxiv.org/abs/1511.06744

    Model names differ from those in the paper, but the csv files for each set of experiments relates the paper's name for the model and the real name of the model here:

    • cifarma.csv: Network-in-Network CIFAR10 Models
    • mitma.csv: MIT Places Models
    • googlenetma.csv: GoogLeNet ILSVRC2012 Models
    • vggma.csv: VGG-11 ILSVRC2012 Models

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
Organization logo

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
zip(2219742528 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

By ccdv (From Huggingface) [source]

About this dataset

The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

How to use the dataset

  • Introduction:

  • File Description:

  • validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

  • train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

  • test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

  • Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

  • Usage Examples: This dataset can be utilized in various ways:

a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

  • Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

Research Ideas

  • Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
  • Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
  • Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...

Search
Clear search
Close search
Google apps
Main menu