100+ datasets found

P
PMOA-CITE Dataset
paperswithcode.com
figshare.com
Updated May 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tong Zeng; Daniel E. Acuna (2024). PMOA-CITE Dataset [Dataset]. https://paperswithcode.com/dataset/pmoa-cite
Explore at:
Dataset updated
May 19, 2024
Authors
Tong Zeng; Daniel E. Acuna
Description
The dataset used in the experiments on the paper "Modeling citation worthiness by using attention‑based bidirectional long short‑term memory networks and interpretable models"

There are one million sentences in total, and further splitted into trainning, validation and testing by 60%, 20% and 20%, respectively.

For the pre-processing of the dataset, please refer to the paper.

The data are stored in jsonl format (each row is an json object), we list a couple of rows as example: {"sec_name":"introduction","cur_sent_id":"12213838@0#3$0","next_sent_id":"12213838@0#3$1","cur_sent":"All three spectrin subunits are essential for normal development.","next_sent":"βH, encoded by the karst locus, is an essential protein that is required for epithelial morphogenesis .","cur_scaled_len_features":{"type":1,"values":[0.17716535433070865,0.13513513513513514]},"next_scaled_len_features":{"type":1,"values":[0.32677165354330706,0.35135135135135137]},"cur_has_citation":0,"next_has_citation":1} {"sec_name":"results","prev_sent_id":"12230634@1@1#0$2","cur_sent_id":"12230634@1@1#0$3","next_sent_id":"12230634@1@1#0$4","prev_sent":"μIU/ml at the 2.0-h postprandial time point.","cur_sent":"Statistically significant differences between the mean plasma insulin levels of dogs treated with 50 mg/kg of GSNO, and those treated with 50 mg/kg GSNO and vitamin C (50 mg/kg) were observed at the 1.0-h and 1.5-h time points (P < 0.05).","next_sent":"The mean plasma insulin concentrations in the dogs treated with 50 mg/kg of vitamin C and 50 mg/kg of GSNO, or 50 mg/kg of GSNO was significantly altered compared to those of controls or captopril-treated dogs (P < 0.05).","prev_scaled_len_features":{"type":1,"values":[0.09448818897637795,0.08108108108108109]},"cur_scaled_len_features":{"type":1,"values":[0.8582677165354331,1.0]},"next_scaled_len_features":{"type":1,"values":[0.7913385826771654,0.9459459459459459]},"prev_has_citation":0,"cur_has_citation":0,"next_has_citation":0}

{"sec_name":"results","prev_sent_id":"12213837@1@0#3$3","cur_sent_id":"12213837@1@0#3$4","next_sent_id":"12213837@1@0#3$5","prev_sent":"Cleavage of VAMP2 by BoNT/D releases the NH2-terminal 59 amino acids from the protein and eliminates exocytosis.","cur_sent":"However, in this case, exocytosis cannot be recovered by addition of the cleaved fragment .","next_sent":"Peptides that exactly correspond to the BoNT/D cleavage site (VAMP2 aa 25–59 and 60–94-cys) were equally efficient at mediating liposome fusion (unpublished data).","prev_scaled_len_features":{"type":1,"values":[0.36220472440944884,0.35135135135135137]},"cur_scaled_len_features":{"type":1,"values":[0.2795275590551181,0.2972972972972973]},"next_scaled_len_features":{"type":1,"values":[0.562992125984252,0.5135135135135135]},"prev_has_citation":0,"cur_has_citation":1,"next_has_citation":0}

For the code using this dataset to modeling citation worthiness, please refer to https://github.com/sciosci/cite-worthiness
POCI CSV dataset of all the citation data
figshare.com
zip
Updated Dec 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenCitations (2022). POCI CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.21776351.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21776351.v1
Dataset updated
Dec 27, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
OpenCitations
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:

[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).

This version of the dataset contains:

717,654,703 citations; 26,024,862 bibliographic resources.

The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.
o
Citation Knowledge with Section and Context
ordo.open.ac.uk
zip
Updated May 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anita Khadka (2020). Citation Knowledge with Section and Context [Dataset]. http://doi.org/10.21954/ou.rd.11346848.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.21954/ou.rd.11346848.v1
Dataset updated
May 5, 2020
Dataset provided by
The Open University
Authors
Anita Khadka
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains information from scientific publications written by authors who have published papers in the RecSys conference. It contains four files which have information extracted from scientific publications. The details of each file are explained below:i) all_authors.tsv: This file contains the details of authors who published research papers in the RecSys conference. The details include authors' identifier in various forms, such as number, orcid id, dblp url, dblp key and google scholar url, authors' first name, last name and their affiliation (where they work)ii) all_publications.tsv: This file contains the details of publications authored by the authors mentioned in the all_authors.tsv file (Please note the list of publications does not contain all the authored publications of the authors, refer to the publication for further details).The details include publications' identifier in different forms (such as number, dblp key, dblp url, dblp key, google scholar url), title, filtered title, published date, published conference and paper abstract.iii) selected_author_publications-information.tsv: This file consists of identifiers of authors and their publications. Here, we provide the information of selected authors and their publications used for our experiment.iv) selected_publication_citations-information.tsv: This file contains the information of the selected publications which consists of both citing and cited papers’ information used in our experiment. It consists of identifier of citing paper, identifier of cited paper, citation title, citation filtered title, the sentence before the citation is mentioned, citing sentence, the sentence after the citation is mentioned, citation position (section).Please note, it does not contain information of all the citations cited in the publications. For more detail, please refer to the paper.This dataset is for the use of research purposes only and if you use this dataset, please cite our paper "Capturing and exploiting citation knowledge for recommending recently published papers" due to be published in Web2Touch track 2020 (not yet published).
Citations to software and data in Zenodo via open sources
zenodo.org
data.niaid.nih.gov
csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephanie van de Sandt; Stephanie van de Sandt; Alex Ioannidis; Alex Ioannidis; Lars Holm Nielsen; Lars Holm Nielsen (2020). Citations to software and data in Zenodo via open sources [Dataset]. http://doi.org/10.5281/zenodo.3482927
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3482927
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stephanie van de Sandt; Stephanie van de Sandt; Alex Ioannidis; Alex Ioannidis; Lars Holm Nielsen; Lars Holm Nielsen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In January 2019, the Asclepias Broker harvested citation links to Zenodo objects from three discovery systems: the NASA Astrophysics Datasystem (ADS), Crossref Event Data and Europe PMC. Each row of our dataset represents one unique link between a citing publication and a Zenodo DOI. Both endpoints are described by basic metadata. The second dataset contains usage metrics for every cited Zenodo DOI of our data sample.
P
DBLP Dataset
paperswithcode.com
Updated Apr 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jie Tang; Jing Zhang; Limin Yao; Juanzi Li; Li Zhang; Zhong Su (2021). DBLP Dataset [Dataset]. https://paperswithcode.com/dataset/dblp
Explore at:
Dataset updated
Apr 13, 2021
Authors
Jie Tang; Jing Zhang; Limin Yao; Juanzi Li; Li Zhang; Zhong Su
Description
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
P
Data from: Citeseer Dataset
paperswithcode.com
huggingface.co
Updated Mar 4, 2007
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
C. Lee Giles; Kurt D. Bollacker; Steve Lawrence (2007). Citeseer Dataset [Dataset]. https://paperswithcode.com/dataset/citeseer
Explore at:
Dataset updated
Mar 4, 2007
Authors
C. Lee Giles; Kurt D. Bollacker; Steve Lawrence
Description
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.
n
Data from: Data reuse and the open data citation advantage
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Oct 1, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather A. Piwowar; Todd J. Vision (2013). Data reuse and the open data citation advantage [Dataset]. http://doi.org/10.5061/dryad.781pv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.781pv
Dataset updated
Oct 1, 2013
Dataset provided by
National Evolutionary Synthesis Center
Authors
Heather A. Piwowar; Todd J. Vision
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
P
MultiCite Dataset
paperswithcode.com
opendatalab.com
Updated Jun 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne Lauscher; Brandon Ko; Bailey Kuehl; Sophie Johnson; David Jurgens; Arman Cohan; Kyle Lo (2021). MultiCite Dataset [Dataset]. https://paperswithcode.com/dataset/multicite
Explore at:
Dataset updated
Jun 30, 2021
Authors
Anne Lauscher; Brandon Ko; Bailey Kuehl; Sophie Johnson; David Jurgens; Arman Cohan; Kyle Lo
Description
MultiCite is a dataset of 12,653 citation contexts from over 1,200 computational linguistics papers used for Citation context analysis (CCA). MultiCite contains multi-sentence, multi-label citation contexts within full paper texts.
I
Data from: OpCitance: Citation contexts identified from the PubMed Central...
databank.illinois.edu
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tzu-Kun Hsiao; Vetle Torvik (2023). OpCitance: Citation contexts identified from the PubMed Central open access articles [Dataset]. http://doi.org/10.13012/B2IDB-4353270_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-4353270_V1
Dataset updated
Feb 15, 2023
Authors
Tzu-Kun Hsiao; Vetle Torvik
Dataset funded by
U.S. National Institutes of Health (NIH)
Description
Sentences and citation contexts identified from the PubMed Central open access articles ---------------------------------------------------------------------- The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019. The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles. Files: • A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A. • B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B. • C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C. • D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D. • E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E. • F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F. • G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G. • H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H. • I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I. • J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J. • K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K. • L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L. • M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M. • N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N. • O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O. • P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1). • P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2). • Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q. • R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R. • S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S. • T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T. • UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V. • W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W. • XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z. Each row in the file is a sentence/citation context and contains the following columns: • pmcid: PMCID of the article • pmid: PMID of the article. If an article does not have a PMID, the value is NONE. • location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs. • IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable. • sentence_id: The ID of the citation context/sentence in the article component • total_sentences: The number of sentences in the article component. • intxt_id: The ID of the citation. • intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-". • intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-". • intxt_mark: The citation marker associated with the inline citation. • best_id: The best source link ID (e.g., PMID) of the citation. • best_source: The sources that confirm the best ID. • best_id_diff: The comparison result between the best_id column and the intxt_pmid column. • citation: A citation context. If no citation is found in a sentence, the value is the sentence. • progression: Text progression of the citation context/sentence. Supplementary Files • PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column. Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns: • pmcid: PMCID of the citing article. • pos: The citation's position in the reference list. • fromPMID: PMID of the citing article. • toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci. • SRC: The sources that confirm the toPMID. • MatchDB: The origin bibliographic database of the toPMID. • Probability: The match probability of the toPMID. • toPMID2: PMID of the citation (as tagged in the XML file). • SRC2: The sources that confirm the toPMID2. • intxt_id: The ID of the citation. • journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files. • same_ref_string: Whether the citation string appears in the reference list more than once. • DIFF: The comparison result between the toPMID column and the toPMID2 column. • bestID: The best source link ID (e.g., PMID) of the citation. • bestSRC: The sources that confirm the best ID. • Match: Matching result produced by Patci. [1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885 • Supplementary_File_1.zip – This file contains the code for generating the dataset.
B
Citing online references
borealisdata.ca
dataone.org
Updated May 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Topps; Corey Wirun; Nishan Sharma (2019). Citing online references [Dataset]. http://doi.org/10.5683/SP2/80VX7U
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/80VX7U
Dataset updated
May 7, 2019
Dataset provided by
Borealis
Authors
David Topps; Corey Wirun; Nishan Sharma
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Citation of reference material is well established for most traditional sources but remains inconsistent in its application for online resources such as web pages, blog posts and materials generated from underlying database queries. We present some tips on how authors can more effectively cite and archive such resources so they are persistent and sustainable.
P
CITE Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malihe Alikhani; Sreyasi Nag Chowdhury; Gerard de Melo; Matthew Stone, CITE Dataset [Dataset]. https://paperswithcode.com/dataset/cite
Explore at:
Authors
Malihe Alikhani; Sreyasi Nag Chowdhury; Gerard de Melo; Matthew Stone
Description
CITE is a crowd-sourced resource for multimodal discourse: this resource characterises inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations.
Z
Uncovering the Citation Landscape: Exploring OpenCitations COCI,...
data.niaid.nih.gov
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo Paolini (2023). Uncovering the Citation Landscape: Exploring OpenCitations COCI, OpenCitations Meta, and ERIH-PLUS in Social Sciences and Humanities Journals - DATA PRODUCED [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7974815
Explore at:
Dataset updated
Sep 7, 2023
Dataset provided by
Marta Soricetti
Lorenzo Paolini
Olga Pagnotta
Sara Vellone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This zipped folders contain all the data produced for the research "Uncovering the Citation Landscape: Exploring OpenCitations COCI, OpenCitations Meta, and ERIH-PLUS in Social Sciences and Humanities Journals": the results datasets (dataset_map_disciplines, dataset_no_SSH, dataset_SSH, erih_meta_with_disciplines and erih_meta_without_disciplines).

dataset_map_disciplines.zip contains CSV files with four columns ("id", "citing", "cited", "disciplines") giving information about publications stored in OpenCitations META (version 3 released on February 2023) and part of SSH journals, according to ERIH PLUS (version downloaded on 2023-04-27), specifying the disciplines associated to them and a boolean value stating if they cite or are cited, according to the OpenCitations COCI dataset (version 19 released on January 2023).

dataset_no_SSH.zip and dataset_SSH.zip contain CSV files with the same structure. Each dataset has four columns: "citing", "is_citing_SSH", "cited", and "is_cited_SSH". ”Citing” and “cited” columns are filled with DOIs of publications stored in OpenCitations META that according to OpenCitations COCI are involved in a citation. The "is_citing_SSH" and "is_cited_SSH" columns contain boolean values: "True" if the corresponding publication is associated with a SSH (Social Sciences and Humanities) discipline, according to ERIH PLUS, and "False" otherwise. The two datasets are built starting from the two different subsets obtained as a result of the union between OpenCitations META and ERIH PLUS: dataset_SSH comes from erih_meta_with_disciplines and dataset_no_SSH from erih_meta_without_disciplines. dataset_no_SSH comes from erih_meta_with_disciplines.zip and erih_meta_without_disciplines.zip, as explained before, contain CSV files originating from ERIH PLUS and META. erih_meta_without_disciplines has just one column “id” and contains the DOIs of all the publications in META that do not have any discipline associated, that is, have not been published on a SSH journal, while erih_meta_with_disciplines derives from all the publications in META that have at least one linked discipline and has two columns: “id” and “erih_disciplines”, containing a string with all the disciplines linked to that publication like "History, Interdisciplinary research in the Humanities, Interdisciplinary research in the Social Sciences, Sociology".

Software: https://doi.org/10.5281/zenodo.8326023

Data preprocessed: https://doi.org/10.5281/zenodo.7973159

Article: https://zenodo.org/record/8326044

DMP: https://zenodo.org/record/8324973

Protocol: https://doi.org/10.17504/protocols.io.n92ldpeenl5b/v5
Data from: InboVeg - NICHE-Vlaanderen groundwater related vegetation relevés...
gbif.org
data.europa.eu
Updated May 4, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Els De Bie; Dimitri Brosens; Els De Bie; Dimitri Brosens (2021). InboVeg - NICHE-Vlaanderen groundwater related vegetation relevés for Flanders, Belgium [Dataset]. http://doi.org/10.15468/gouexm
Explore at:
Unique identifier
https://doi.org/10.15468/gouexm
Dataset updated
May 4, 2021
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
Research Institute for Nature and Forest (INBO)
Authors
Els De Bie; Dimitri Brosens; Els De Bie; Dimitri Brosens
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
May 21, 2002 - Jul 7, 2005
Area covered

Description
The NICHE-Vlaanderen project had the goal to develop an hydro-ecological prediction model, used in ecological impact assessment studies. The data in this dataset is part of the vegetation-plot data used to feed the model and contains groundwater depending terrestrial vegetation relevées in relation to groundwater levels. Vegetation plot relevés were performed near selected piezometers (WATINA database, groundwater network Flanders) between May and August in 2002, 2004 and 2005. Initially the vegetation surveys were recorded in Turboveg (Hennekens, 1998) and later on moved to INBOVEG, the INBO vegetation plot database. The dataset contains 569 vegetation relevées, recorded during the fieldwork of the NICHE-Vlaanderen project. Relevées contain species coverage data, coverage data for layers, vegetation height and the date of recording. All the vegetation relevées were classified as vegetation types. Issues related to the dataset can by submitted here: https://github.com/inbo/data-publication/tree/master/datasets/inboveg-niche-vlaanderen-events

To allow anyone to use this dataset, we have released the data to the public domain under a Creative Commons Zero waiver (http://creativecommons.org/publicdomain/zero/1.0/). We would appreciate however, if you read and follow these norms for data use (http://www.inbo.be/en/norms-for-data-use) and provide a link to the original dataset (https://doi.org/10.15468/gouexm) whenever possible. If you use these data for a scientific paper, please cite the dataset following the applicable citation norms and/or consider us for co-authorship. We are always interested to know how you have used or visualized the data, or to provide more information, so please contact us via the contact information provided in the metadata, opendata@inbo.be or https://twitter.com/LifeWatchINBO.
Z
Methodology data of "A qualitative and quantitative citation analysis toward...
data.niaid.nih.gov
zenodo.org
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peroni, Silvio (2024). Methodology data of "A qualitative and quantitative citation analysis toward retracted articles: a case of study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4024337
Explore at:
Dataset updated
Aug 2, 2024
Dataset provided by
Heibi, Ivan
Peroni, Silvio
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This document contains the datasets and visualizations generated after the application of the methodology defined in our work: "A qualitative and quantitative citation analysis toward retracted articles: a case of study". The methodology defines a citation analysis of the Wakefield et al. [1] retracted article from a quantitative and qualitative point of view. The data contained in this repository are based on the first two steps of the methodology. The first step of the methodology (i.e. “Data gathering”) builds an annotated dataset of the citing entities, this step is largely discussed also in [2]. The second step (i.e. "Topic Modelling") runs a topic modeling analysis on the textual features contained in the dataset generated by the first step.

Note: the data are all contained inside the "method_data.zip" file. You need to unzip the file to get access to all the files and directories listed below.

Data gathering

The data generated by this step are stored in "data/":

"cits_features.csv": a dataset containing all the entities (rows in the CSV) which have cited the Wakefield et al. retracted article, and a set of features characterizing each citing entity (columns in the CSV). The features included are: DOI ("doi"), year of publication ("year"), the title ("title"), the venue identifier ("source_id"), the title of the venue ("source_title"), yes/no value in case the entity is retracted as well ("retracted"), the subject area ("area"), the subject category ("category"), the sections of the in-text citations ("intext_citation.section"), the value of the reference pointer ("intext_citation.pointer"), the in-text citation function ("intext_citation.intent"), the in-text citation perceived sentiment ("intext_citation.sentiment"), and a yes/no value to denote whether the in-text citation context mentions the retraction of the cited entity ("intext_citation.section.ret_mention"). Note: this dataset is licensed under a Creative Commons public domain dedication (CC0).

"cits_text.csv": this dataset stores the abstract ("abstract") and the in-text citations context ("intext_citation.context") for each citing entity identified using the DOI value ("doi"). Note: the data keep their original license (the one provided by their publisher). This dataset is provided in order to favor the reproducibility of the results obtained in our work.

Topic modeling We run a topic modeling analysis on the textual features gathered (i.e. abstracts and citation contexts). The results are stored inside the "topic_modeling/" directory. The topic modeling has been done using MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [3]. The topic modeling results for each textual feature are separated into two different folders, "abstracts/" for the abstracts, and "intext_cit/" for the in-text citation contexts. Both the directories contain the following directories/files:

"mitao_workflows/": the workflows of MITAO. These are JSON files that could be reloaded in MITAO to reproduce the results following the same workflows.

"corpus_and_dictionary/": it contains the dictionary and the vectorized corpus given as inputs for the LDA topic modeling.

"coherence/coherence.csv": the coherence score of several topic models trained on a number of topics from 1 - 40.

"datasets_and_views/": the datasets and visualizations generated using MITAO.

References

Wakefield, A., Murch, S., Anthony, A., Linnell, J., Casson, D., Malik, M., Berelowitz, M., Dhillon, A., Thomson, M., Harvey, P., Valentine, A., Davies, S., & Walker-Smith, J. (1998). RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet, 351(9103), 637–641. https://doi.org/10.1016/S0140-6736(97)11096-0

Heibi, I., & Peroni, S. (2020). A methodology for gathering and annotating the raw-data/characteristics of the documents citing a retracted article v1 (protocols.io.bdc4i2yw) [Data set]. In protocols.io. ZappyLab, Inc. https://doi.org/10.17504/protocols.io.bdc4i2yw

Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3
Data from: Dataset for 'A Matter of Culture? Conceptualising and...
zenodo.org
csv
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rhodri Leng; Rhodri Leng; Justyna Bandola-Gill; Justyna Bandola-Gill; Katherine Smith; Katherine Smith; Valerie Pattyn; Valerie Pattyn; Niklas Andersen; Niklas Andersen (2024). Dataset for 'A Matter of Culture? Conceptualising and Investigating 'Evidence Cultures' within Research on Evidence-Informed Policymaking' [Dataset]. http://doi.org/10.5281/zenodo.13972074
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13972074
Dataset updated
Oct 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rhodri Leng; Rhodri Leng; Justyna Bandola-Gill; Justyna Bandola-Gill; Katherine Smith; Katherine Smith; Valerie Pattyn; Valerie Pattyn; Niklas Andersen; Niklas Andersen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 22, 2024
Description
Introduction
This document describes the data collection and datasets used in the manuscript "A Matter of Culture? Conceptualising and Investigating ‘Evidence Cultures’ within Research on Evidence-Informed Policymaking" [1].

Data Collection

To construct the citation network analysed in the manuscript, we first designed a series of queries to capture a large sample of literature exploring the relationship between evidence, policy, and culture from various perspectives. Our team of domain experts developed the following queries based on terms common in the literature. These queries search for the terms included in the titles, abstracts, and associated keywords of WoS indexed records (i.e. ‘TS=’). While these are separated below for ease of reading, they combined into a single query via the OR operator in our search. Our search was conducted on the Web of Science’s (WoS) Core Collection through the University of Edinburgh Library subscription on 29/11/2023, returning a total of 2,089 records.

TS = ((“cultures of evidence” OR “culture of evidence” OR “culture of knowledge” OR “cultures of knowledge” OR “research culture” OR “research cultures” OR “culture of research” OR “cultures of research” OR “epistemic culture” OR “epistemic cultures” OR “epistemic community” OR “epistemic communities” OR “epistemic infrastructure” OR “evaluation culture” OR “evaluation cultures” OR “culture of evaluation” OR “cultures of evaluation” OR “thought style” OR “thought styles” OR “thought collective” OR “thought collectives” OR “knowledge regime” OR “knowledge regimes” OR “knowledge system” OR “knowledge systems” OR “civic epistemology” OR “civic epistemologies”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “policy decision” OR “policy decisions” OR “political decision” OR “political decisions” OR “political decision making”))

OR

TS = ((“culture” OR “cultures”) AND ((“evidence-based” OR “evidence-informed” OR “evidence-led” OR “science-based” OR “science-informed” OR “science-led” OR “research-based” OR “research-informed” OR “evidence use” OR “evidence user” OR “evidence utilisation” OR “evidence utilization” OR “research use” OR “researcher user” OR “research utilisation” OR “research utilization” OR “research in” OR “evidence in” OR “science in”) NEAR/1 (“policymaking” OR “policy making” OR “policy maker” OR “policy makers”)))

OR

TS = ((“culture” OR “cultures”) AND (“scientific advice” OR “technical advice” OR “scientific expertise” OR “technical expertise” OR “expert advice”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “political decision” OR “political decisions” OR “political decision making”))

OR

TS = ((“culture” OR “cultures”) AND (“post-normal science” OR “trans-science” OR “transdisciplinary” OR “transdisiplinarity” OR “science-policy interface” OR “policy sciences” OR “sociology of knowledge” OR “sociology of science” OR “knowledge transfer” OR “knowledge translation” OR “knowledge broker” OR “implementation science” OR “risk society”) AND (“policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers”))

Citation Network Construction

All bibliographic metadata on these 2,089 records were downloaded in five batches in plain text and then merged in R. We then parsed these data into network readable files. All unique reference strings are given unique node IDs. A node-attribute-list (‘CE_Node’) links identifying information of each document with its node ID, including authors, title, year of publication, journal WoS ID, and WoS citations. An edge-list (‘CE_Edge’) records all citations from these documents to their bibliographies – with edges going from a citing document to the cited – using the relevant node IDs. These data were then cleaned by (a) matching DOIs for reference strings that differ but point to the same paper, and (b) manual merging of obvious duplicates caused by referencing errors.

Our initial dataset consisted of 2,089 retrieved documents and 123,772 unretrieved cited documents (i.e. documents that were cited within the publications we retrieved but which were not one of these 2,089 documents). These documents were connected by 157,229 citation links, but ~87% of the documents in the network were cited just once. To focus on relevant literature, we filtered the network to include only documents with at least three citation or reference links. We further refined the dataset by focusing on the main connected component, resulting in 6,650 nodes and 29,198 edges. It is this dataset that we publish here, and it is this network that underpins Figure 1, Table 1, and the qualitative examination of documents (see manuscript for further details).

Our final network dataset contains 1,819 of the documents in our original query (~87% of the original retrieved records), and 4,831 documents not retrieved via our Web of Science search but cited by at least three of the retrieved documents. We then clustered this network by modularity maximization via the Leiden algorithm [2], detecting 14 clusters with Q=0.59. Citations to documents within the same cluster constitute ~77% of all citations in the network.

Citation Network Dataset Description

We include two network datasets: (i) ‘CE_Node.csv’ that contains 1,819 retrieved documents, 4,831 unretrieved referenced documents, making for a total of 6,650 documents (nodes); (ii)’CE_Edge.csv’ that records citations (edges) between the documents (nodes), including a total of 29,198 citation links. These files can be used to construct a network with many different tools, but we have formatted these to be used in Gephi 0.10[3].

‘CE_Node.csv’ is a comma-separate values file that contains two types of nodes:

i. Retrieved documents – these are documents captured by our query. These include full bibliographic metadata and reference lists.

ii. Non-retrieved documents – these are documents referenced by our retrieved documents but were not retrieved via our query. These only have data contained within their reference string (i.e. first author, journal or book title, year of publication, and possibly DOI).

The columns in the .csv refer to:

- Id, the node ID

- Label, the reference string of the document

- DOI, the DOI for the document, if available

- WOS_ID, WoS accession number

- Authors, named authors

- Title, title of document

- Document_type, variable indicating whether a document is an article, review, etc.

- Journal_book_title, journal of publication or title of book

- Publication year, year of publication.

- WOS_times_cited, total Core Collection citations as of 29/11/2023

- Indegree, number of within network citations to a given document

- Cluster, provides the cluster membership number as discussed in the manuscript (Figure 1)

‘CE_Edge.csv’ is a comma-separated values file that contains edges (citation links) between nodes (documents) (n=29,198). The columns refer to:

- Source, node ID of the citing document

- Target, node ID of the cited document

Cluster Analysis

We qualitatively analyse a set of publications from seven of the largest clusters in our manuscript.
ACL-ARC dataset
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TONG ZENG (2023). ACL-ARC dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12573872.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12573872.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
TONG ZENG
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in the experiments on the paper "Modeling citation worthiness by using attention‑based bidirectional long short‑term memory networks and interpretable models"For the pre-processing of the dataset, please refer to the paper Bonab et al., 2018 (http://doi.org/10.1145/3209978.3210162)We downloaded a copy of that dataset, adjusted some fields. The data are stored in jsonl format (each row is an json object), we list a couple of rows as example:{"cur_sent":"the nespole uses a client server architecture to allow a common user who is initially browsing through the web pages of a service provider on the internet to connect seamlessly to a human agent of the service provider who speaks another language and provides speech to speech translation service between the two parties","cur_scaled_len_features":{"type":1,"values":[0.06936542669584245,0.07202216066481995]},"cur_has_citation":1}
For the code using this dataset to modeling citation worthiness, please refer to https://github.com/sciosci/cite-worthiness
Z
Dataset for Machine Learning Assisted Citation Screening for Systematic...
data.niaid.nih.gov
Updated Dec 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhrangadhariya, Anjani (2023). Dataset for Machine Learning Assisted Citation Screening for Systematic Reviews [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10423426
Explore at:
Dataset updated
Dec 22, 2023
Dataset provided by
Müller, Henning
Hilfiker, Roger
Dhrangadhariya, Anjani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The work "Machine Learning Assisted Citation Screening for Systematic Reviews" explored the problem of citation screening automation using machine-learning (ML) with an aim to accelerate the process of generating systematic reviews. Manual process of citation screening involve two reviewers manually screening the searched studies using a predefined inclusion criteria. If the study passes the "inclusion" criteria, it is included for further analysis or is excluded. As apparant through manual screening process, the work considered citation screening as a binary classification problem whereby any ML classifier could be trained to separate the searched studies into these two classes (include and exclude).

A physiotherapy citation screening dataset was used to test automation approaches and the dataset includes the studies identified for citation screening in an update to the systematic review by Hilfiker et al. The dataset included titles and abstracts (citations) from 31,279 (deduplicated: 25,540) studies identified during the search phase of this SR. These studies were already manually assessed for relevance and labelled by two reviewers into two mutually exclusive labels. The uploaded file consists of 25,540 data samples, with each data sample separated by a new line. It is a tab separated file and the data in it is structured as shown below. This dataset was manually labelled into include and exclude by Hilfiker et al.

Title PMID Abstract Class MeSH terms (separated by a pipe)

Structured exercise improves physical functioning in women with stages I and II breast cancer: results of a randomized controlled trial.
11157015 Abstract PURPOSE: Self-directed and supervised exercise were compared with usual care in a clinical trial designed to evaluate the effect of structured exercise on physical functioning and other dimensions of health-related quality of life in women with stages I and II breast cancer. PATIENTS AND METHODS: One hundred twenty-three women with stages I and II breast cancer completed baseline evaluations of generic and disease- and site-specific health-related quality of life, aerobic capacity, and body weight. Participants were randomly allocated to one of three intervention groups: usual care (control group), self-directed exercise, or supervised exercise. Quality of life, aerobic capacity, and body weight measures were repeated at 26 weeks... include or exclude Clinical Trial | Comparative Study | Randomized Controlled Trial | Research Support, Non-U.S. Gov't | Antineoplastic Combined Chemotherapy Protocols | Breast Neoplasms | Breast Neoplasms | Breast Neoplasms | Chemotherapy, Adjuvant | Exercise | Female | Humans | Middle Aged | Neoplasm Staging | Quality of Life | Radiotherapy, Adjuvant

If you use this dataset in your research, please cite our papers.
Z
Data from: BIP! NDR (NoDoiRefs): a dataset of citations from papers without...
data.niaid.nih.gov
zenodo.org
Updated Nov 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vergoulis, Thanasis (2024). BIP! NDR (NoDoiRefs): a dataset of citations from papers without DOIs in computer science conferences and workshops [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7962019
Explore at:
Dataset updated
Nov 22, 2024
Dataset provided by
Chatzopoulos, Serafeim
Vergoulis, Thanasis
Tryfonopoulos, Christos
Koloveas, Paris
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation has created a void in available data.

BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text. The current version of the dataset contains ~3.6M citations made by approximately 192.6K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI. The DBLP snapshot used for this version was the one released on November 2024.

File Structure:

The dataset is formatted as a JSON Lines (JSONL) file (one JSON Object per line) to facilitate file splitting and streaming.

Each JSON object has three main fields:

“_id”: a unique identifier,

“citing_paper”, the “dblp_id” of the citing paper,

“cited_papers”: array containing the objects that correspond to each reference found in the text of the “citing_paper”; each object may contain the following fields:

“dblp_id”: the “dblp_id” of the cited paper. Optional - this field is required if a “doi” is not present.

“doi”: the doi of the cited paper. Optional - this field is required if a “dblp_id” is not present.

“bibliographic_reference”: the raw citation string as it appears in the citing paper.

Changes from previous version:

Added more papers from DBLP.
Data from: Invasive species - American bullfrog (Lithobates catesbeianus) in...
gbif.org
data.biodiversity.be
+4more
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sander Devisscher; Tim Adriaens; Gerald Louette; Dimitri Brosens; Peter Desmet; Sander Devisscher; Tim Adriaens; Gerald Louette; Dimitri Brosens; Peter Desmet (2025). Invasive species - American bullfrog (Lithobates catesbeianus) in Flanders, Belgium [Dataset]. http://doi.org/10.15468/2hqkqn
Explore at:
Unique identifier
https://doi.org/10.15468/2hqkqn
Dataset updated
May 15, 2025
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
Research Institute for Nature and Forest (INBO)
Authors
Sander Devisscher; Tim Adriaens; Gerald Louette; Dimitri Brosens; Peter Desmet; Sander Devisscher; Tim Adriaens; Gerald Louette; Dimitri Brosens; Peter Desmet
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Apr 27, 2010 - Dec 31, 2018
Area covered

Description
Invasive species - American bullfrog (Lithobates catesbeianus) in Flanders, Belgium is a species occurrence dataset published by the Research Institute for Nature and Forest (INBO). The dataset contains over 7,500 occurrences (25% of which are American bullfrogs) sampled between 2010 until now, in the months April to October. The data are compiled from different sources at the INBO, but most of the occurrences were collected through fieldwork for the EU co-funded Interreg project INVEXO (http://www.invexo.eu). In this project, research was conducted on different methods for the management of American bullfrog populations, an alien invasive species in Belgium. Captured bullfrogs were almost always removed from the environment and humanely killed, while the other occurrences are recorded bycatch, which were released upon catch (see bibliography for detailed descriptions of the methods). Therefore, caution is advised when using these data for trend analysis, distribution range calculation, or other. Issues with the dataset can be reported at https://github.com/inbo/data-publication/tree/master/datasets/invasive-bullfrog-occurrences
We strongly believe an open attitude is essential for tackling the IAS problem (Groom et al. 2015). To allow anyone to use this dataset, we have released the data to the public domain under a Creative Commons Zero waiver (http://creativecommons.org/publicdomain/zero/1.0/). We would appreciate it however if you read and follow these norms for data use (http://www.inbo.be/en/norms-for-data-use) and provide a link to the original dataset (https://doi.org/10.15468/2hqkqn) whenever possible. If you use these data for a scientific paper, please cite the dataset following the applicable citation norms and/or consider us for co-authorship. We are always interested to know how you have used or visualized the data, or to provide more information, so please contact us via the contact information provided in the metadata, opendata@inbo.be or https://twitter.com/LifeWatchINBO.
Data from: Standards Incorporated by Reference (SIBR) Database
s.cnmilf.com
datasets.ai
+2more
Updated Sep 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Standards Incorporated by Reference (SIBR) Database [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/standards-incorporated-by-reference-sibr-database
Explore at:
Dataset updated
Sep 30, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a searchable historical collection of standards referenced in regulations - Voluntary consensus standards, government-unique standards, industry standards, and international standards referenced in the Code of Federal Regulations (CFR).

Facebook

Twitter

Click to copy link

Link copied

Cite

Tong Zeng; Daniel E. Acuna (2024). PMOA-CITE Dataset [Dataset]. https://paperswithcode.com/dataset/pmoa-cite

PMOA-CITE Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

May 19, 2024

Authors

Tong Zeng; Daniel E. Acuna

Description

The dataset used in the experiments on the paper "Modeling citation worthiness by using attention‑based bidirectional long short‑term memory networks and interpretable models"

There are one million sentences in total, and further splitted into trainning, validation and testing by 60%, 20% and 20%, respectively.

For the pre-processing of the dataset, please refer to the paper.

The data are stored in jsonl format (each row is an json object), we list a couple of rows as example: {"sec_name":"introduction","cur_sent_id":"12213838@0#3$0","next_sent_id":"12213838@0#3$1","cur_sent":"All three spectrin subunits are essential for normal development.","next_sent":"βH, encoded by the karst locus, is an essential protein that is required for epithelial morphogenesis .","cur_scaled_len_features":{"type":1,"values":[0.17716535433070865,0.13513513513513514]},"next_scaled_len_features":{"type":1,"values":[0.32677165354330706,0.35135135135135137]},"cur_has_citation":0,"next_has_citation":1} {"sec_name":"results","prev_sent_id":"12230634@1@1#0$2","cur_sent_id":"12230634@1@1#0$3","next_sent_id":"12230634@1@1#0$4","prev_sent":"μIU/ml at the 2.0-h postprandial time point.","cur_sent":"Statistically significant differences between the mean plasma insulin levels of dogs treated with 50 mg/kg of GSNO, and those treated with 50 mg/kg GSNO and vitamin C (50 mg/kg) were observed at the 1.0-h and 1.5-h time points (P < 0.05).","next_sent":"The mean plasma insulin concentrations in the dogs treated with 50 mg/kg of vitamin C and 50 mg/kg of GSNO, or 50 mg/kg of GSNO was significantly altered compared to those of controls or captopril-treated dogs (P < 0.05).","prev_scaled_len_features":{"type":1,"values":[0.09448818897637795,0.08108108108108109]},"cur_scaled_len_features":{"type":1,"values":[0.8582677165354331,1.0]},"next_scaled_len_features":{"type":1,"values":[0.7913385826771654,0.9459459459459459]},"prev_has_citation":0,"cur_has_citation":0,"next_has_citation":0}

{"sec_name":"results","prev_sent_id":"12213837@1@0#3$3","cur_sent_id":"12213837@1@0#3$4","next_sent_id":"12213837@1@0#3$5","prev_sent":"Cleavage of VAMP2 by BoNT/D releases the NH2-terminal 59 amino acids from the protein and eliminates exocytosis.","cur_sent":"However, in this case, exocytosis cannot be recovered by addition of the cleaved fragment .","next_sent":"Peptides that exactly correspond to the BoNT/D cleavage site (VAMP2 aa 25–59 and 60–94-cys) were equally efficient at mediating liposome fusion (unpublished data).","prev_scaled_len_features":{"type":1,"values":[0.36220472440944884,0.35135135135135137]},"cur_scaled_len_features":{"type":1,"values":[0.2795275590551181,0.2972972972972973]},"next_scaled_len_features":{"type":1,"values":[0.562992125984252,0.5135135135135135]},"prev_has_citation":0,"cur_has_citation":1,"next_has_citation":0}

For the code using this dataset to modeling citation worthiness, please refer to https://github.com/sciosci/cite-worthiness

Clear search

Close search

Google apps

Main menu

PMOA-CITE Dataset

POCI CSV dataset of all the citation data

Citation Knowledge with Section and Context

Citations to software and data in Zenodo via open sources

DBLP Dataset

Data from: Citeseer Dataset

Data from: Data reuse and the open data citation advantage

MultiCite Dataset

Data from: OpCitance: Citation contexts identified from the PubMed Central...

Citing online references

CITE Dataset

Uncovering the Citation Landscape: Exploring OpenCitations COCI,...

Data from: InboVeg - NICHE-Vlaanderen groundwater related vegetation relevés...

Methodology data of "A qualitative and quantitative citation analysis toward...

Data from: Dataset for 'A Matter of Culture? Conceptualising and...

ACL-ARC dataset

Dataset for Machine Learning Assisted Citation Screening for Systematic...

Data from: BIP! NDR (NoDoiRefs): a dataset of citations from papers without...

Data from: Invasive species - American bullfrog (Lithobates catesbeianus) in...

Data from: Standards Incorporated by Reference (SIBR) Database

PMOA-CITE Dataset