Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.
The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.
Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.
With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.
Introduction:
File Description:
validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.
train.csv: The purpose of this file is to provide training data for summarizing scientific articles.
test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.
Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.
Usage Examples: This dataset can be utilized in various ways:
a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.
b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.
c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.
- Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.
Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions
- Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
- Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
- Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a combination of **arXiv **and Wikipedia data, containing metadata and textual content related to academic publications and encyclopedic knowledge. It includes key attributes such as authors, titles, digital object identifiers (DOI), categories, abstracts, update dates, URLs, and full text content of documents.
Key Features Authors: Names of the researchers or contributors to each document. Title: The title of the publication or article. DOI: A unique identifier for academic papers. Categories: Classification labels indicating the subject area (e.g., astrophysics, mathematics, physics). **Abstract: **A summary of the document’s content. Update Date: The most recent modification date of the document. URL: A link to the full document. Text: The full textual content of the document.
The dataset contains over 2.6 million unique values for text-based fields. Use Cases Academic research analysis: Studying trends in scientific publications over time. Natural Language Processing (NLP): Developing models for summarization, classification, and text generation. Knowledge extraction: Identifying key themes and topics in scientific and encyclopedic data. Citation and impact studies: Analyzing author influence and research impact based on citations. This dataset is a valuable resource for text mining, AI training, and scientific knowledge analysis, providing a rich blend of structured metadata and unstructured text.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description unarXive is a scholarly data set containing publications' structured full-text, annotated in-text citations, linked non-text content (mathematical notation, figure/table captions) and a citation network. The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files. Typical uses are
Training of ML models (citation recommendation, summarization, LLMs) Citation context analysis Bibliographic analyses Access ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ D O W N L O A D S A M P L E ┃┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛ Regarding the full data set, please note the following:
Note: this Zenodo record is the "open subset" of unarXive, which contains all permissively licensed papers from arXiv.org. You can find the full version here. The code used for generating the data set is publicly available.
Facebook
TwitterThe ArXiv CS Papers Multi-Label Classification dataset is a comprehensive collection of research papers from the computer science domain. This dataset is intended for multi-label classification tasks and contains a diverse range of research papers spanning various topics within computer science.
The dataset consists of approximately 200,000+ research papers and includes the following columns:
Paper ID: A unique identifier for each research paper in the dataset.Title: The title of the research paper.Abstract: A brief summary or abstract of the research paper.Year: The publication year of the research paper.Primary Category: The primary category of the research paper, representing the main topic or area of focus.Categories: Additional categories or subtopics associated with the research paper.This dataset is well-suited for tasks related to text classification, topic modeling, information retrieval, and other natural language processing (NLP) tasks. Researchers and practitioners can leverage this dataset to develop and evaluate machine learning models for multi-label classification on a wide range of computer science topics.
Note: Please refer to the original ArXiv repository for access to the full-text content of the papers and proper citation guidelines. This dataset contains metadata and should be used for research and educational purposes only.
We hope that the ArXiv CS Papers Multi-Label Classification dataset serves as a valuable resource for researchers, data scientists, and machine learning enthusiasts in their quest to advance knowledge and understanding in the field of computer science.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.
We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].
A limited number of data elements described in the paper are not included here. The following elements are excluded:
The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.
The free-text comments written by raters during the ratings process.
Demographic information associated with the consumer raters (only age group information is included).
Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).
Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2
Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z
Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.
Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.
Independent Ratings [ratings_independent.csv]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence with three possible values (No bias, Minor bias, Severe bias). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.
Paired Ratings [ratings_pairwise.csv]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias). Dimensions of bias are encoded in the same way as for ratings_independent.csv. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.
Counterfactual Paired Ratings [ratings_counterfactual.csv]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff, how_answers_diff). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.
Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].
Equity in Health AI (EHAI) [equitymedqa_ehai.csv]: Contains questions that compose the EHAI dataset.
Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv]: Contains questions that compose the FBRT-Manual dataset.
Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv]: Contains questions that compose the extended FBRT-LLM dataset.
Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.
TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv]: Contains questions that compose the TRINDS dataset.
Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv]: Contains pairs of questions that compose the CC-Manual dataset.
Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv]: Contains pairs of questions that compose the CC-LLM dataset.
HealthSearchQA [other_datasets_healthsearchqa.csv]: Contains questions sampled from the HealthSearchQA dataset [1,2].
Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq]: Contains questions that compose the Mixed MMQA-OMAQ dataset.
Omiye et al. [other datasets_omiye_et_al]: Contains questions proposed in Omiye et al. [3].
Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)
WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.
NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the paper entitled "Beyond designer’s knowledge: Generating materials design hypotheses via a large language model". For detailed information about the file contents, see 'Folder Contents Description.pdf'.Provider is willing to license its rights in the Prompts and Codes (“Provider’s Rights”) to academic researchers to use free of charge solely for academic, non-commercial research purposes subject to the terms and conditions outlined herein. The Prompts and Codes were created at the University of Wisconsin ("UW") by Quanliang Liu and Hyunseok Oh. Please note Provider's Rights may include, but are not limited to, certain patents or patent applications owned by the Wisconsin Alumni Research Foundation (“WARF”).https://arxiv.org/abs/2409.06756
Facebook
Twitterhttps://choosealicense.com/licenses/pddl/https://choosealicense.com/licenses/pddl/
arXiv Figures Dataset
This dataset contains image-text pairs extracted from figures from papers published until the end of 2020 in the arXiv repository. The dataset can be used to train CLIP models. This repo contains a Parquet file containing the metadata of a WebDataset in img2dataset format. The images themselves are not distributed and need to be retrieved. Note that the images cannot be retrieved by an HTTP URL, so img2dataset cannot be used as is to retrieve the data. Instead… See the full description on the dataset page: https://huggingface.co/datasets/nopperl/arxiv-image-text.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Data
Status of the scalar singlet dark matter model arXiv:1705.07931
The files in this record contain data for the scalar singlet dark matter model considered in the GAMBIT "Round 1" scalar singlet paper.
The files consist of
Three YAML files, each corresponding to a different parameter range
StandardModel_SLHA2_SingletDM_scan_15.yaml, a universal YAML fragment included from the other three YAML files
Three hdf5 files. SingletDM.hdf5 contains the combined results of all sampling runs, and is the basis for the profile likelihood plots in the paper. SingletDM_TW_full.hdf5 and SingletDM_TW_lowmass.hdf5 contain the results from T-Walk scans over the full and low-mass parameter ranges, respectively. These are the bases for the marginalised posterior plots in the paper.
An example pip file corresponding to each hdf5 file, for producing plots using pippi
A tarball best_fits_yaml.tar.gz containing YAML files of the best-fit point in each subregion of the fit.
The YAML files corresponding to different parameter ranges follow the naming scheme SingletDM_[slice].yaml, where slice may be full, lowmass or neck. Each of these YAML files contains entries in the Scanners node for running Diver, MultiNest, TWalk and GreAT.
A few caveats to keep in mind:
The YAML files that we give here are updated compared to the ones that we used when generating the hdf5 file, in order to match the set of available options in the release version of GAMBIT 1.0.0. The included physics and numerics are however identical.
The YAML files are designed to work with the tagged release of GAMBIT 1.0.0, and the pip file is tested with pippi 2.0, commit 2ab061a8. They may or may not work with later versions of either software (but you can of course always obtain the version that they do work with via the git history).
The pip file is an example only. Users wishing to reproduce the more advanced plots in any of the GAMBIT papers should contact us for tips or scripts, or experiment for themselves. Many of these scripts are in multiple parts and require undocumented manual interventions and steps in order to implement various plot-specific customisations, so please don't expect the same level of polish as for files provided here or in the GAMBIT repo.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the paper entitled "Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering" by Maciej P. Polak and Dane Morganhttps://arxiv.org/abs/2303.05352BulkModulus_test_database_MPPolak_DMorgan.xlsx - dataset of bulk modulus text passages and sentences used for methods assessment.CriticalCoolingRates_MGs_database_MPPolak_DMorgan.xlsx - a database of critical cooling rates of metallic glasses. The data is presented in three versions and (described in detail in the paper), i.e. "raw", "cleaned", and "standardized". The critical cooling rate additionally includes manually extracted data serving as ground truth for tests, in sheets labeled as "manual". In addition, a "standardized_MG" database is included, which limits the results to metallic glasses only, together with "standardized_tables_MG" for values extracted from tables, and "Figure_Classification" which contains Figure numbers, captions, and DOIs of their source documents.YieldStrength_HEAs_database_MPPolak_DMorgan.xlsx - a database of yield strengths in the context of high entropy alloys. The data is presented in three versions and (described in detail in the paper), i.e. "raw", "cleaned", and "standardized". In addition, a "standardized_HEA" database is included, which limits the results to HEAs only, together with "standardized_tables_HEA" for values extracted from tables, and "Figure_Classification" which contains Figure numbers, captions, and DOIs of their source documents.ChatExtract_Code_MPPolak_DMorgan.zip - These files contain the ChatExtract code with a short example and instructions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots"https://arxiv.org/abs/2503.12326This repository contains datasets and tools related to PlotExtract, a pipeline for automated plot digitization using LLM-based vision models. Below is a description of the key components:Dataset Output Files*.out_data - Results of LLM-based visual data extraction from plot images. These files contain the extracted data points in CSV-like format.*.out_code - Python code generated by the LLM to recreate the source plot using the extracted data.*.out_conversation - Full conversations with the LLM conducted by PlotExtract, including prompts and responses.interpolated_* - Visual and statistical comparisons based on interpolation between the LLM-extracted data and the ground-truth. These correspond to the interpolation accuracy assessments described in the paper.pointwise_* - Visual and statistical comparisons on a point-by-point basis between extracted and ground-truth data. These correspond to pointwise accuracy evaluations from the main text.*.stats - Numerical summaries of extraction accuracy, referenced in the associated visual comparisons.*.csv - Manually extracted ground truth data used as reference for evaluating extraction accuracy.All of the above files are generated automatically during PlotExtract execution.Published, Synthetic, and chartQA DatasetThe Published Dataset does not include original plot images due to copyright restrictions. Instead, each plot is referenced in source_images.csv, which lists:DOI of the source publicationFigure numberFilename used in this datasetThe Synthetic Dataset includes synthetic plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes.The chartQA Dataset (https://doi.org/10.48550/arXiv.2203.10244) includes chartQA plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes. There are two equivaent datasets: FULL and CROPPED, the first one containing original images and the second one containing images cropped as much as possible to preserve the plot only and remove additional text.CodesAll source code, including PlotExtract and supporting scripts for evaluation and comparison, is included in MPPolak_DMorgan_PlotExtract_Codes.zip.Each script contains usage instructions in-line and is intended to be self-explanatory for users familiar with Python-based data processing workflows.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
One of the large language models trained in this paper: https://arxiv.org/abs/1810.10045
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of citation networks arXiv-HepTh and arXiv-HepPh.
Facebook
TwitterThis dataset is a curated subset of the original arXiv dataset, each entry enriched with a 256-dimensional embedding vector. The embeddings are generated using OpenAI's "text-embedding-3-small" model. For each data point, the embedding is created by concatenating the text of the title, author(s), and abstract into a single string, which is then processed by the embedding model. This approach captures the semantic essence of each document, facilitating tasks such as similarity search… See the full description on the dataset page: https://huggingface.co/datasets/MongoDB/subset_arxiv_papers_with_embeddings.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a comprehensive collection of over 3 million research paper titles and abstracts, curated and consolidated from multiple high-quality academic sources. The dataset provides a unified, clean, and standardized format for researchers, data scientists, and machine learning practitioners working on natural language processing, academic research analysis, and knowledge discovery tasks.
title and abstract columns| Metric | Value |
|---|---|
| Total Records | ~3,000,000+ |
| Columns | 2 (title, abstract) |
| File Size | 4.15 GB |
| Format | CSV |
| Duplicates | Removed |
| Missing Values | Removed |
cleaned_papers.csv
├── title (string): Scientific paper title
└── abstract (string): Scientific paper abstract
The dataset underwent a rigorous cleaning and standardization process:
title and abstract formatThis dataset is ideal for:
This dataset consolidates academic papers from the following sources:
This dataset represents a point-in-time consolidation. Future versions may include: - Additional academic sources - Extended fields (authors, publication dates, venues) - Domain-specific subsets - Enhanced metadata
Please respect the individual licenses of the source datasets. This consolidated version is provided for research and educational purposes. When using this dataset:
🙏 Acknowledgments
Special thanks to all the original dataset creators and the academic communities that make their research data publicly available. This work builds upon their valuable contributions to open science and knowledge sharing.
Keywords: academic papers, research abstracts, NLP, machine learning, text mining, scientific literature, ArXiv, PubMed, natural language processing, research dataset
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains hard negative examples generated using cross-encoders for training dense retrieval models. The data was used in the paper Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval. This dataset is part of the Hugging Face collection: arxiv-hard-negatives-68027bbc601ff6cc8eb1f449 Please cite if you found this dataset useful 🤗 @misc{sinha2025dontretrievegenerateprompting, title={Don't Retrieve, Generate: Prompting LLMs for Synthetic… See the full description on the dataset page: https://huggingface.co/datasets/chungimungi/arxiv-hard-negatives-cross-encoder.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This file contains the replication data for the paper "The Economics of Lost Knowledge: Modeling the Knowledge Cost Due to Non-FAIR Data Practices." It includes the two networks used in the paper, arXiv and OpenAlex, a SQLite database to check whether a link is available on the internet, and the raw network as extracted from arXiv. Aquest fitxer conté les dades de replicació de l’article "L’economia del coneixement perdut: modelant el cost del coneixement degut a pràctiques de dades no FAIR." Inclou les dues xarxes utilitzades a l’article, arXiv i OpenAlex, una base de dades SQLite per comprovar si un enllaç està disponible a internet, i la xarxa en brut tal com va ser extreta d’arXiv. Este archivo contiene los datos de replicación del artículo "La economía del conocimiento perdido: modelando el coste del conocimiento debido a prácticas de datos no FAIR." Incluye las dos redes utilizadas en el artículo, arXiv y OpenAlex, una base de datos SQLite para comprobar si un enlace está disponible en internet, y la red en bruto tal como fue extraída de arXiv.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Replication Data for: Order Book Queue Hawkes-Markovian Modeling. Manuscript available at https://arxiv.org/abs/2107.09629.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AcTBeCalf Dataset Description
The AcTBeCalf dataset is a comprehensive dataset designed to support the classification of pre-weaned calf behaviors from accelerometer data. It contains detailed accelerometer readings aligned with annotated behaviors, providing a valuable resource for research in multivariate time-series classification and animal behavior analysis. The dataset includes accelerometer data collected from 30 pre-weaned Holstein Friesian and Jersey calves, housed in group pens at the Teagasc Moorepark Research Farm, Ireland. Each calf was equipped with a 3D accelerometer sensor (AX3, Axivity Ltd, Newcastle, UK) sampling at 25 Hz and attached to a neck collar from one week of birth over 13 weeks.
This dataset encompasses 27.4 hours of accelerometer data aligned with calf behaviors, including both prominent behaviors like lying, standing, and running, as well as less frequent behaviors such as grooming, social interaction, and abnormal behaviors.
The dataset consists of a single CSV file with the following columns:
dateTime: Timestamp of the accelerometer reading, sampled at 25 Hz.
calfid: Identification number of the calf (1-30).
accX: Accelerometer reading for the X axis (top-bottom direction)*.
accY: Accelerometer reading for the Y axis (backward-forward direction)*.
accZ: Accelerometer reading for the Z axis (left-right direction)*.
behavior: Annotated behavior based on an ethogram of 23 behaviors.
segId: Segment identification number associated with each accelerometer reading/row, representing all readings of the same behavior segment.
Code Files Description
The dataset is accompanied by several code files to facilitate the preprocessing and analysis of the accelerometer data and to support the development and evaluation of machine learning models. The main code files included in the dataset repository are:
accelerometer_time_correction.ipynb: This script corrects the accelerometer time drift, ensuring the alignment of the accelerometer data with the reference time.
shake_pattern_detector.py: This script includes an algorithm to detect shake patterns in the accelerometer signal for aligning the accelerometer time series with reference times.
aligning_accelerometer_data_with_annotations.ipynb: This notebook aligns the accelerometer time series with the annotated behaviors based on timestamps.
manual_inspection_ts_validation.ipynb: This notebook provides a manual inspection process for ensuring the accurate alignment of the accelerometer data with the annotated behaviors.
additional_ts_generation.ipynb: This notebook generates additional time-series data from the original X, Y, and Z accelerometer readings, including Magnitude, ODBA (Overall Dynamic Body Acceleration), VeDBA (Vectorial Dynamic Body Acceleration), pitch, and roll.
genSplit.py: This script provides the logic used for the generalized subject separation for machine learning model training, validation and testing.
active_inactive_classification.ipynb: This notebook details the process of classifying behaviors into active and inactive categories using a RandomForest model, achieving a balanced accuracy of 92%.
four_behv_classification.ipynb: This notebook employs the mini-ROCKET feature derivation mechanism and a RidgeClassifierCV to classify behaviors into four categories: drinking milk, lying, running, and other, achieving a balanced accuracy of 84%.
Kindly cite one of the following papers when using this data:
Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Evaluating ROCKET and Catch22 features for calf behaviour classification from accelerometer data using Machine Learning models. arXiv preprint arXiv:2404.18159.
Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Development of a digital tool for monitoring the behaviour of pre-weaned calves using accelerometer neck-collars. arXiv preprint arXiv:2406.17352
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Neural network model files and Madgraph event generator outputs used as inputs to the results presented in the paper "Learning to discover: expressive Gaussian mixture models for multi-dimensional simulation and parameter inference in the physical sciences" arXiv:2108.11481Code and model files can be found at:https://github.com/darrendavidprice/science-discovery/tree/master/expressive_gaussian_mixture_models
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Models from experiments referenced in the paper "Training CNNs with Low-Rank Filters for Efficient Image Classification", https://arxiv.org/abs/1511.06744
Model names differ from those in the paper, but the csv files for each set of experiments relates the paper's name for the model and the real name of the model here:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.
The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.
Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.
With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.
Introduction:
File Description:
validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.
train.csv: The purpose of this file is to provide training data for summarizing scientific articles.
test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.
Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.
Usage Examples: This dataset can be utilized in various ways:
a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.
b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.
c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.
- Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.
Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions
- Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
- Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
- Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...