100+ datasets found

Data articles in journals
zenodo.org
csv, txt, xls
Updated May 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro (2025). Data articles in journals [Dataset]. http://doi.org/10.5281/zenodo.15553313
Explore at:
txt, csv, xlsAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15553313
Dataset updated
May 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2025
Description
Version: 6

Date of data collection: May 2025 General description: Publication of datasets according to the FAIR principles could be reached publishing a data paper (and/or a software paper) in data journals as well as in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list: - data_articles_journal_list_v6.xlsx: full list of 177 academic journals in which data papers or/and software papers could be published - data_articles_journal_list_v6.csv: full list of 177 academic journals in which data papers or/and software papers could be published - readme_v6.txt, with a detailed descritption of the dataset and its variables. Relationship between files: both files have the same information. Two different formats are offered to improve reuse Type of version of the dataset: final processed version Versions of the files: 6th version - Information updated: number of journals (17 were added and 4 were deleted), URL, document types associated to a specific journal. - Information added: diamond journals were identified.

Version: 5

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2023/09/05

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:

- data_articles_journal_list_v5.xlsx: full list of 162 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v5.csv: full list of 162 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 5th version
- Information updated: number of journals, URL, document types associated to a specific journal.
163 journals (excel y csv)

Version: 4

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2022/12/15

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:

- data_articles_journal_list_v4.xlsx: full list of 140 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v4.csv: full list of 140 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 4th version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR), Scopus and Web of Science (WOS), Journal Master List.

Version: 3

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2022/10/28

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:

- data_articles_journal_list_v3.xlsx: full list of 124 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_3.csv: full list of 124 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 3rd version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR).

Erratum - Data articles in journals Version 3:

Botanical Studies -- ISSN 1999-3110 -- JCR (JIF) Q2
Data -- ISSN 2306-5729 -- JCR (JIF) n/a
Data in Brief -- ISSN 2352-3409 -- JCR (JIF) n/a

Version: 2

Author: Francisco Rubio, Universitat Politècnia de València.

Date of data collection: 2020/06/23

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:

- data_articles_journal_list_v2.xlsx: full list of 56 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v2.csv: full list of 56 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 2nd version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Scimago Journal and Country Rank (SJR)

Total size: 32 KB

Version 1: Description

This dataset contains a list of journals that publish data articles, code, software articles and database articles.

The search strategy in DOAJ and Ulrichsweb was the search for the word data in the title of the journals.
Acknowledgements:
Xaquín Lores Torres for his invaluable help in preparing this dataset.
Dataset 1: Studies included in literature review
catalog.data.gov
data.amerigeoss.org
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Dataset 1: Studies included in literature review [Dataset]. https://catalog.data.gov/dataset/dataset-1-studies-included-in-literature-review
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset contains the results of a literature review of experimental nutrient addition studies to determine which nutrient forms were most often measured in the scientific literature. To obtain a representative selection of relevant studies, we searched Web of Science™ using a search string to target experimental studies in artificial and natural lotic systems while limiting irrelevant papers. We screened the titles and abstracts of returned papers for relevance (experimental studies in streams/stream mesocosms that manipulated nutrients). To supplement this search, we sorted the relevant articles from the Web of Science™ search alphabetically by author and sequentially examined the bibliographies for additional relevant articles (screening titles for relevance, and then screening abstracts of potentially relevant articles) until we had obtained a total of 100 articles. If we could not find a relevant article electronically, we moved to the next article in the bibliography. Our goal was not to be completely comprehensive, but to obtain a fairly large sample of published, peer-reviewed studies from which to assess patterns. We excluded any lentic or estuarine studies from consideration and included only studies that used mesocosms mimicking stream systems (flowing water or stream water source) or that manipulated nutrient concentrations in natural streams or rivers. We excluded studies that used nutrient diffusing substrate (NDS) because these manipulate nutrients on substrates and not in the water column. We also excluded studies examining only nutrient uptake, which rely on measuring dissolved nutrient concentrations with the goal of characterizing in-stream processing (e.g., Newbold et al., 1983). From the included studies, we extracted or summarized the following information: study type, study duration, nutrient treatments, nutrients measured, inclusion of TN and/or TP response to nutrient additions, and a description of how results were reported in relation to the research-management mismatch, if it existed. Below is information on how the search was conducted: Search string used for Web of Science advanced search Search conducted on 27 September 2016. TS= (stream OR creek OR river* OR lotic OR brook OR headwater OR tributary) AND TS = (mesocosm OR flume OR "artificial stream" OR "experimental stream" OR "nutrient addition") AND TI= (nitrogen OR phosphorus OR nutrient OR enrichment OR fertilization OR eutrophication)
PLOS Open Science Indicators
plos.figshare.com
zip
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public Library of Science (2025). PLOS Open Science Indicators [Dataset]. http://doi.org/10.6084/m9.figshare.21687686.v10
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21687686.v10
Dataset updated
Jul 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Public Library of Science
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains article metadata and information about Open Science Indicators for approximately 139,000 research articles published in PLOS journals from 1 January 2018 to 30 March 2025 and a set of approximately 28,000 comparator articles published in non-PLOS journals. This is the tenth release of this dataset, which will be updated with new versions on an annual basis.This version of the Open Science Indicators dataset shares the indicators seen in the previous versions as well as fully operationalised protocols and study registration indicators, which were previously only shared in preliminary forms. The v10 dataset focuses on detection of five Open Science practices by analysing the XML of published research articles:Sharing of research data, in particular data shared in data repositoriesSharing of codePosting of preprintsSharing of protocolsSharing of study registrationsThe dataset provides data and code generation and sharing rates, the location of shared data and code (whether in Supporting Information or in an online repository). It also provides preprint, protocol and study registration sharing rates as well as details of the shared output, such as publication date, URL/DOI/Registration Identifier and platform used. Additional data fields are also provided for each article analysed. This release has been run using an updated preprint detection method (see OSI-Methods-Statement_v10_Jul25.pdf for details). Further information on the methods used to collect and analyse the data can be found in Documentation.Further information on the principles and requirements for developing Open Science Indicators is available in https://doi.org/10.6084/m9.figshare.21640889.Data folders/filesData Files folderThis folder contains the main OSI dataset files PLOS-Dataset_v10_Jul25.csv and Comparator-Dataset_v10_Jul25.csv, which containdescriptive metadata, e.g. article title, publication data, author countries, is taken from the article .xml filesadditional information around the Open Science Indicators derived algorithmicallyand the OSI-Summary-statistics_v10_Jul25.xlsx file contains the summary data for both PLOS-Dataset_v10_Jul25.csv and Comparator-Dataset_v10_Jul25.csv.Documentation folderThis file contains documentation related to the main data files. The file OSI-Methods-Statement_v10_Jul25.pdf describes the methods underlying the data collection and analysis. OSI-Column-Descriptions_v10_Jul25.pdf describes the fields used in PLOS-Dataset_v10_Jul25.csv and Comparator-Dataset_v10_Jul25.csv. OSI-Repository-List_v1_Dec22.xlsx lists the repositories and their characteristics used to identify specific repositories in the PLOS-Dataset_v10_Jul25.csv and Comparator-Dataset_v10_Jul25.csv repository fields.The folder also contains documentation originally shared alongside the preliminary versions of the protocols and study registration indicators in order to give fuller details of their detection methods.Contact details for further information:Iain Hrynaszkiewicz, Director, Open Research Solutions, PLOS, ihrynaszkiewicz@plos.org / plos@plos.orgLauren Cadwallader, Open Research Manager, PLOS, lcadwallader@plos.org / plos@plos.orgAcknowledgements:Thanks to Allegra Pearce, Tim Vines, Asura Enkhbayar, Scott Kerr and parth sarin of DataSeer for contributing to data acquisition and supporting information.
Z
Open Science for Social Sciences and Humanities: Open Access availability...
data.niaid.nih.gov
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastiano Giacomini (2023). Open Science for Social Sciences and Humanities: Open Access availability and distribution across disciplines and Countries in OpenCitations Meta - RESULTS DATASET (with Mega Journals) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8250857
Explore at:
Dataset updated
Aug 18, 2023
Dataset provided by
Seyedali Ghasempouri
Maddalena Ghiotto
Sebastiano Giacomini
License
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Description
The dataset contains all the data produced running the research software for the study:"Open Science for Social Sciences and Humanities: Open Access availability and distribution across disciplines and Countries in OpenCitations Meta".

Disclaimer: these results are not considered to be representative, because we have fount that Mega Journals skewed significantly some of the data. The result datasets without Mega Journals are published here.

Description of datasets:

SSH_Publications_in_OC_Meta_and_Open_Access_status.csv: containing information about OpenCitations Meta coverage of ERIH PLUS Journals as well as their Open Access availability. In this dataset, every row holds data for a Journal of ERIH PLUS also covered by OpenCitations Meta database. It is structured with the following columns: "EP_id", the internal ERIH PLUS identifier; "Publications_in_venue", the numbers of Publications counted in each venue; "OC_omid", the internal OpenCitations Meta identifier for the venue; "issn", numbers of publications in each venue; "Open Access", a value to represent if the journal is OA or not, either "True" or "Unknown".

SSH_Publications_by_Discipline.csv: containing information about number of publications per discipline (in addition, number of journals per discipline are also included). The dataset has three columns, the first, labeled "Discipline", contains single disciplines of the ERIH classificaton, the second and the third, labeled "Journal_count" and "Publication_count", respectively, the number of Journals and the number of Publications counted for each discipline.

SSH_Publications_and_Journals_by_Country: containing information about number of publications and journals per country. The dataset has three columns, the first, labeled "Country", contains single countries of the ERIH classificaton, the second and the third, labeled "Journal_count" and "Publication_count", respectively, the number of Journals and the number of Publications counted for each discipline.

result_disciplines.json: the dictionary containing all disciplines as key and a list of related ERIH PLUS venue identifiers as value.

result_countries.json: the dictionary containing all countries as key and a list of related ERIH PLUS venue identifiers as value.

duplicate_omids.csv: a dataset containing the duplicated Journal entries in OpenCitations Meta, structured with two columns: "OC_omid", the internal OC Meta identifier; "issn", the issn values associated to that identifier

eu_data.csv: contains the data specific for European countries' SSH Journals covered in OCMeta. It is structured with the following columns: "EP_id", the internal ERIH PLUS identifier; "Publications_in_venue", the numbers of Publications counted in each venue; "Original_Title", "Country_of_Publication","ERIH_PLUS_Disciplines", "disc_count", the number of disciplines per Journal.

eu_disciplines_count.csv: containing information about number of publications per discipline and number of journals per discipline of european countries. The dataset has three columns, the first, labeled "Discipline", contains single disciplines of the ERIH classificaton, the second and the third, labeled "Journal_count" and "Publication_count", respectively, the number of Journals and the number of Publications counted for each discipline.

meta_coverage_eu.csv: contains the data specific for European countries' SSH Journals covered in OCMeta. It is structured with the following columns: "EP_id", the internal ERIH PLUS identifier; "Publications_in_venue", the numbers of Publications counted in each venue; "OC_omid", the internal OpenCitations Meta identifier for the venue; "issn", numbers of publications in each venue; "Open Access", a value to represent if the journal is OA or not, either "True" or "Unknown".

us_data.csv: contains the data specific for the United States' SSH Journals covered in OCMeta. It is structured with the following columns: "EP_id", the internal ERIH PLUS identifier; "Publications_in_venue", the numbers of Publications counted in each venue; "Original_Title", "Country_of_Publication","ERIH_PLUS_Disciplines", "disc_count", the number of disciplines per Journal.

us_disciplines_count.csv: containing information about number of publications per discipline and number of journals per discipline of the United States. The dataset has three columns, the first, labeled "Discipline", contains single disciplines of the ERIH classificaton, the second and the third, labeled "Journal_count" and "Publication_count", respectively, the number of Journals and the number of Publications counted for each discipline.

meta_coverage_us.csv: contains the data specific for the United States' SSH Journals covered in OCMeta. It is structured with the following columns: "EP_id", the internal ERIH PLUS identifier; "Publications_in_venue", the numbers of Publications counted in each venue; "OC_omid", the internal OpenCitations Meta identifier for the venue; "issn", numbers of publications in each venue; "Open Access", a value to represent if the journal is OA or not, either "True" or "Unknown".

Abstract of the research:

Purpose: this study aims to investigate the representation and distribution of Social Science and Humanities (SSH) journals within the OpenCitations Meta database, with a particular emphasis on their Open Access (OA) status, as well as their spread across different disciplines and countries. The underlying premise is that open infrastructures play a pivotal role in promoting transparency, reproducibility, and trust in scientific research. Study Design and Methodology: the study is grounded on the premise that open infrastructures are crucial for ensuring transparency, reproducibility, and fostering trust in scientific research. The research methodology involved the use of secondary data sources, namely the OpenCitations Meta database, the ERIH PLUS bibliographic index, and the DOAJ index. A custom research software was developed in Python to facilitate the processing and analysis of the data. Findings: the results reveal that 78.1% of SSH journals listed in the European Reference Index for the Humanities (ERIH-PLUS) are included in the OpenCitations Meta database. The discipline of Psychology has the highest number of publications. The United States and the United Kingdom are the leading contributors in terms of the number of publications. However, the study also uncovers that only 38% of the SSH journals in the OpenCitations Meta database are OA. Originality: this research adds to the existing body of knowledge by providing insights into the representation of SSH in open bibliographic databases and the role of open access in this domain. The study highlights the necessity for advocating OA practices within SSH and the significance of open data for bibliometric studies. It further encourages additional research into the impact of OA on various facets of citation patterns and the factors leading to disparity across disciplinary representation.

Related resources:

Ghasempouri S., Ghiotto M., & Giacomini S. (2023). Open Science for Social Sciences and Humanities: Open Access availability and distribution across disciplines and Countries in OpenCitations Meta - RESEARCH ARTICLE. https://doi.org/10.5281/zenodo.8263908

Ghasempouri, S., Ghiotto, M., Giacomini, S., (2023). Open Science for Social Sciences and Humanities: Open Access availability and distribution across disciplines and Countries in OpenCitations Meta - DATA MANAGEMENT PLAN (Version 4). Zenodo. https://doi.org/10.5281/zenodo.8174644

Ghasempouri, S., Ghiotto, M., Giacomini, S. (2023e). Open Science for Social Sciences and Humanities: Open Access availability and distribution across disciplines and Countries in OpenCitations Meta - PROTOCOL. V.5. (https://dx.doi.org/10.17504/protocols.io.5jyl8jo1rg2w/v5)
s
Analysis of CBCS publications for Open Access, data availability statements...
figshare.scilifelab.se
researchdata.se
txt
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Theresa Kieselbach (2025). Analysis of CBCS publications for Open Access, data availability statements and persistent identifiers for supplementary data [Dataset]. http://doi.org/10.17044/scilifelab.23641749.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.23641749.v1
Dataset updated
Jan 15, 2025
Dataset provided by
Umeå University
Authors
Theresa Kieselbach
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General descriptionThis dataset contains some markers of Open Science in the publications of the Chemical Biology Consortium Sweden (CBCS) between 2010 and July 2023. The sample of CBCS publications during this period consists of 188 articles. Every publication was visited manually at its DOI URL to answer the following questions.1. Is the research article an Open Access publication?2. Does the research article have a Creative Common license or a similar license?3. Does the research article contain a data availability statement?4. Did the authors submit data of their study to a repository such as EMBL, Genbank, Protein Data Bank PDB, Cambridge Crystallographic Data Centre CCDC, Dryad or a similar repository?5. Does the research article contain supplementary data?6. Do the supplementary data have a persistent identifier that makes them citable as a defined research output?VariablesThe data were compiled in a Microsoft Excel 365 document that includes the following variables.1. DOI URL of research article2. Year of publication3. Research article published with Open Access4. License for research article5. Data availability statement in article6. Supplementary data added to article7. Persistent identifier for supplementary data8. Authors submitted data to NCBI or EMBL or PDB or Dryad or CCDCVisualizationParts of the data were visualized in two figures as bar diagrams using Microsoft Excel 365. The first figure displays the number of publications during a year, the number of publications that is published with open access and the number of publications that contain a data availability statement (Figure 1). The second figure shows the number of publication sper year and how many publications contain supplementary data. This figure also shows how many of the supplementary datasets have a persistent identifier (Figure 2).File formats and softwareThe file formats used in this dataset are:.csv (Text file).docx (Microsoft Word 365 file).jpg (JPEG image file).pdf/A (Portable Document Format for archiving).png (Portable Network Graphics image file).pptx (Microsoft Power Point 365 file).txt (Text file).xlsx (Microsoft Excel 365 file)All files can be opened with Microsoft Office 365 and work likely also with the older versions Office 2019 and 2016. MD5 checksumsHere is a list of all files of this dataset and of their MD5 checksums.1. Readme.txt (MD5: 795f171be340c13d78ba8608dafb3e76)2. Manifest.txt (MD5: 46787888019a87bb9d897effdf719b71)3. Materials_and_methods.docx (MD5: 0eedaebf5c88982896bd1e0fe57849c2),4. Materials_and_methods.pdf (MD5: d314bf2bdff866f827741d7a746f063b),5. Materials_and_methods.txt (MD5: 26e7319de89285fc5c1a503d0b01d08a),6. CBCS_publications_until_date_2023_07_05.xlsx (MD5: 532fec0bd177844ac0410b98de13ca7c),7. CBCS_publications_until_date_2023_07_05.csv (MD5: 2580410623f79959c488fdfefe8b4c7b),8. Data_from_CBCS_publications_until_date_2023_07_05_obtained_by_manual_collection.xlsx (MD5: 9c67dd84a6b56a45e1f50a28419930e5),9. Data_from_CBCS_publications_until_date_2023_07_05_obtained_by_manual_collection.csv (MD5: fb3ac69476bfc57a8adc734b4d48ea2b),10. Aggregated_data_from_CBCS_publications_until_2023_07_05.xlsx (MD5: 6b6cbf3b9617fa8960ff15834869f793),11. Aggregated_data_from_CBCS_publications_until_2023_07_05.csv (MD5: b2b8dd36ba86629ed455ae5ad2489d6e),12. Figure_1_CBCS_publications_until_2023_07_05_Open_Access_and_data_availablitiy_statement.xlsx (MD5: 9c0422cf1bbd63ac0709324cb128410e),13. Figure_1.pptx (MD5: 55a1d12b2a9a81dca4bb7f333002f7fe),14. Image_of_figure_1.jpg (MD5: 5179f69297fbbf2eaaf7b641784617d7),15. Image_of_figure_1.png (MD5: 8ec94efc07417d69115200529b359698),16. Figure_2_CBCS_publications_until_2023_07_05_supplementary_data_and_PID_for_supplementary_data.xlsx (MD5: f5f0d6e4218e390169c7409870227a0a),17. Figure_2.pptx (MD5: 0fd4c622dc0474549df88cf37d0e9d72),18. Image_of_figure_2.jpg (MD5: c6c68b63b7320597b239316a1c15e00d),19. Image_of_figure_2.png (MD5: 24413cc7d292f468bec0ac60cbaa7809)
An open dataset of scholarly publications highlighted by journal editors
zenodo.org
bin
Updated Nov 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexis-Michel Mugabushaka; Alexis-Michel Mugabushaka; Jasmin Sadat; Faria; Jasmin Sadat; Faria (2020). An open dataset of scholarly publications highlighted by journal editors [Dataset]. http://doi.org/10.5281/zenodo.4275660
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4275660
Dataset updated
Nov 17, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexis-Michel Mugabushaka; Alexis-Michel Mugabushaka; Jasmin Sadat; Faria; Jasmin Sadat; Faria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract:

We present a dataset of scholarly publications featured as outstanding by journal editors.
This first version, which covers the last 10 years, includes papers referenced in (a) Breakthroughs of the year by Science Magazine (b) Les 10 découvertes de l'année by La Recherche Magazine (c) research highlights by Nature journal and (c) Editors' choice by Science magazine.

The rationale and process of its creation are described in details in the following pre-print:

Mugabushaka, A.M. , Sadat, J and Dantas Faria, J.C. (2020). In Search of Outstanding Research Advances: prototyping the creation of an open dataset of "editorial highlights" . Arxiv 2011.07910

Description of the dataset

In this version, the datasets are released in 4 Microsoft Excel Files.

1. Science - Breakthroughs of the year

2. La Recherche - les dix découvertes de l'année

3. Nature Magazine - Research highlights

4. Science magazine - Editors' choices

The entries for each year are recorded in a separate sheet.

In each sheet, the highlighting article and its metadata are recorded as well as the referenced papers together with their identifiers (doi)
o
Citation Knowledge with Section and Context
ordo.open.ac.uk
zip
Updated May 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anita Khadka (2020). Citation Knowledge with Section and Context [Dataset]. http://doi.org/10.21954/ou.rd.11346848.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.21954/ou.rd.11346848.v1
Dataset updated
May 5, 2020
Dataset provided by
The Open University
Authors
Anita Khadka
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains information from scientific publications written by authors who have published papers in the RecSys conference. It contains four files which have information extracted from scientific publications. The details of each file are explained below:i) all_authors.tsv: This file contains the details of authors who published research papers in the RecSys conference. The details include authors' identifier in various forms, such as number, orcid id, dblp url, dblp key and google scholar url, authors' first name, last name and their affiliation (where they work)ii) all_publications.tsv: This file contains the details of publications authored by the authors mentioned in the all_authors.tsv file (Please note the list of publications does not contain all the authored publications of the authors, refer to the publication for further details).The details include publications' identifier in different forms (such as number, dblp key, dblp url, dblp key, google scholar url), title, filtered title, published date, published conference and paper abstract.iii) selected_author_publications-information.tsv: This file consists of identifiers of authors and their publications. Here, we provide the information of selected authors and their publications used for our experiment.iv) selected_publication_citations-information.tsv: This file contains the information of the selected publications which consists of both citing and cited papers’ information used in our experiment. It consists of identifier of citing paper, identifier of cited paper, citation title, citation filtered title, the sentence before the citation is mentioned, citing sentence, the sentence after the citation is mentioned, citation position (section).Please note, it does not contain information of all the citations cited in the publications. For more detail, please refer to the paper.This dataset is for the use of research purposes only and if you use this dataset, please cite our paper "Capturing and exploiting citation knowledge for recommending recently published papers" due to be published in Web2Touch track 2020 (not yet published).
Survey data of "Mapping Research Output to the Sustainable Development Goals...
zenodo.org
explore.openaire.eu
bin, pdf, zip
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maurice Vanderfeesten; Maurice Vanderfeesten; Eike Spielberg; Eike Spielberg; Yassin Gunes; Yassin Gunes (2024). Survey data of "Mapping Research Output to the Sustainable Development Goals (SDGs)" [Dataset]. http://doi.org/10.5281/zenodo.3813230
Explore at:
bin, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3813230
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maurice Vanderfeesten; Maurice Vanderfeesten; Eike Spielberg; Eike Spielberg; Yassin Gunes; Yassin Gunes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains information on what papers and concepts researchers find relevant to map domain specific research output to the 17 Sustainable Development Goals (SDGs).

Sustainable Development Goals are the 17 global challenges set by the United Nations. Within each of the goals specific targets and indicators are mentioned to monitor the progress of reaching those goals by 2030. In an effort to capture how research is contributing to move the needle on those challenges, we earlier have made an initial classification model than enables to quickly identify what research output is related to what SDG. (This Aurora SDG dashboard is the initial outcome as proof of practice.)

In order to validate our current classification model (on soundness/precision and completeness/recall), and receive input for improvement, a survey has been conducted to capture expert knowledge from senior researchers in their research domain related to the SDG. The survey was open to the world, but mainly distributed to researchers from the Aurora Universities Network. The survey was open from October 2019 till January 2020, and captured data from 244 respondents in Europe and North America.

17 surveys were created from a single template, where the content was made specific for each SDG. Content, like a random set of publications, of each survey was ingested by a data provisioning server. That collected research output metadata for each SDG in an earlier stage. It took on average 1 hour for a respondent to complete the survey. The outcome of the survey data can be used for validating current and optimizing future SDG classification models for mapping research output to the SDGs.

The survey contains the following questions (see inside dataset for exact wording):

Are you familiar with this SDG?

Respondents could only proceed if they were familiar with the targets and indicators of this SDG. Goal of this question was to weed out un knowledgeable respondents and to increase the quality of the survey data.

Suggest research papers that are relevant for this SDG (upload list)

This question, to provide a list, was put first to reduce influenced by the other questions. Goal of this question was to measure the completeness/recall of the papers in the result set of our current classification model. (To lower the bar, these lists could be provided by either uploading a file from a reference manager (preferred) in .ris of bibtex format, or by a list of titles. This heterogenous input was processed further on by hand into a uniform format.)

Select research papers that are relevant for this SDG (radio buttons: accept, reject)

A randomly selected set of 100 papers was injected in the survey, out of the full list of thousands of papers in the result set of our current classification model. Goal of this question was to measure the soundness/precision of our current classification model.

Select and Suggest Keywords related to SDG (checkboxes: accept | text field: suggestions)

The survey was injected with the top 100 most frequent keywords that appeared in the metadata of the papers in the result set of the current classification model. respondents could select relevant keywords we found, and add ones in a blank text field. Goal of this question was to get suggestions for keywords we can use to increase the recall of relevant papers in a new classification model.

Suggest SDG related glossaries with relevant keywords (text fields: url)

Open text field to add URL to lists with hundreds of relevant keywords related to this SDG. Goal of this question was to get suggestions for keywords we can use to increase the recall of relevant papers in a new classification model.

Select and Suggest Journals fully related to SDG (checkboxes: accept | text field: suggestions)

The survey was injected with the top 100 most frequent journals that appeared in the metadata of the papers in the result set of the current classification model. Respondents could select relevant journals we found, and add ones in a blank text field. Goal of this question was to get suggestions for complete journals we can use to increase the recall of relevant papers in a new classification model.

Suggest improvements for the current queries (text field: suggestions per target)

We showed respondents the queries we used in our current classification model next to each of the targets within the goal. Open text fields were presented to change, add, re-order, delete something (keywords, boolean operators, etc. ) in the query to improve it in their opinion. Goal of this question was to get suggestions we can use to increase the recall and precision of relevant papers in a new classification model.

In the dataset root you'll find the following folders and files:

/00-survey-input/

This contains the survey questions for all the individual SDGs. It also contains lists of EIDs categorised to the SDGs we used to make randomized selections from to present to the respondents.

/01-raw-data/

This contains the raw survey output. (Excluding privacy sensitive information for public release.) This data needs to be combined with the data on the provisioning server to make sense.

/02-aggregated-data/

This data is where individual responses are aggregated. Also the survey data is combined with the provisioning server, of all sdg surveys combined, responses are aggregated, and split per question type.

/03-scripts/

This contains scripts to split data, and to add descriptive metadata for text analysis in a later stage.

/04-processed-data/

This is the main final result that can be used for further analysis. Data is split by SDG into subdirectories, in there you'll find files per question type containing the aggregated data of the respondents.

/images/

images of the results used in this README.md.

LICENSE.md

terms and conditions for reusing this data.

README.md

description of the dataset; each subfolders contains a README.md file to futher describe the content of each sub-folder.

In the /04-processed-data/ you'll find in each SDG sub-folder the following files.:

SDG-survey-questions.pdf

This file contains the survey questions

</li> <li>SDG-survey-questions.doc <ul> <li>This file contains the survey questions</li> </ul> </li> <li>SDG-survey-respondents-per-sdg.csv <ul> <li>Basic information about the survey and responses</li> </ul> </li> <li>SDG-survey-city-heatmap.csv <ul> <li>Origin of the respondents per SDG survey</li> </ul> </li> <li>SDG-survey-suggested-publications.txt <ul> <li>Formatted list of research papers researchers have uploaded or listed they want to see back in the result-set for this SDG.</li> </ul> </li> <li>SDG-survey-suggested-publications-with-eid-match.csv <ul> <li>same as above, only matched with an EID. EIDs are matched my Elsevier's internal fuzzy matching algorithm. Only papers with high confidence are show with a match of an EID, referring to a record in Scopus.</li> </ul> </li> <li>SDG-survey-selected-publications-accepted.csv <ul> <li>Based on our previous result set of papers, researchers were presented random samples, they selected papers they believe represent this SDG. (TRUE=accepted)</li> </ul> </li> <li>SDG-survey-selected-publications-rejected.csv <ul> <li>Based on our previous result set of papers, researchers were presented random samples, they selected papers they believe not to represent this SDG. (FALSE=rejected)</li> </ul> </li> <li>SDG-survey-selected-keywords.csv <ul> <li>Based on our previous result set of papers, we presented researchers the keywords that are in the metadata of those papers, they selected keywords they believe represent this SDG.</li> </ul> </li> <li>SDG-survey-unselected-keywords.csv <ul> <li>As "selected-keywords", this is the list of keywords that respondents have not selected to represent this SDG.</li> </ul> </li> <li>SDG-survey-suggested-keywords.csv <ul> <li>List of keywords researchers suggest to use to find papers related to this SDG</li> </ul> </li> <li>SDG-survey-glossaries.csv <ul> <li>List of glossaries, containing keywords, researchers suggest to use to find papers related to this SDG</li> </ul> </li> <li>SDG-survey-selected-journals.csv <ul> <li>Based on our previous result set of papers, we presented researchers the journals that are in the metadata of those papers, they selected journals they believe represent this SDG.</li> </ul> </li> <li>SDG-survey-unselected-journals.csv <ul> <li>As "selected-journals", this is the list of journals
I
Data from: OpCitance: Citation contexts identified from the PubMed Central...
databank.illinois.edu
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tzu-Kun Hsiao; Vetle Torvik (2023). OpCitance: Citation contexts identified from the PubMed Central open access articles [Dataset]. http://doi.org/10.13012/B2IDB-4353270_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-4353270_V1
Dataset updated
Feb 15, 2023
Authors
Tzu-Kun Hsiao; Vetle Torvik
Dataset funded by
U.S. National Institutes of Health (NIH)
Description
Sentences and citation contexts identified from the PubMed Central open access articles ---------------------------------------------------------------------- The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019. The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles. Files: • A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A. • B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B. • C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C. • D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D. • E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E. • F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F. • G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G. • H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H. • I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I. • J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J. • K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K. • L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L. • M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M. • N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N. • O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O. • P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1). • P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2). • Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q. • R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R. • S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S. • T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T. • UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V. • W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W. • XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z. Each row in the file is a sentence/citation context and contains the following columns: • pmcid: PMCID of the article • pmid: PMID of the article. If an article does not have a PMID, the value is NONE. • location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs. • IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable. • sentence_id: The ID of the citation context/sentence in the article component • total_sentences: The number of sentences in the article component. • intxt_id: The ID of the citation. • intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-". • intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-". • intxt_mark: The citation marker associated with the inline citation. • best_id: The best source link ID (e.g., PMID) of the citation. • best_source: The sources that confirm the best ID. • best_id_diff: The comparison result between the best_id column and the intxt_pmid column. • citation: A citation context. If no citation is found in a sentence, the value is the sentence. • progression: Text progression of the citation context/sentence. Supplementary Files • PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column. Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns: • pmcid: PMCID of the citing article. • pos: The citation's position in the reference list. • fromPMID: PMID of the citing article. • toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci. • SRC: The sources that confirm the toPMID. • MatchDB: The origin bibliographic database of the toPMID. • Probability: The match probability of the toPMID. • toPMID2: PMID of the citation (as tagged in the XML file). • SRC2: The sources that confirm the toPMID2. • intxt_id: The ID of the citation. • journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files. • same_ref_string: Whether the citation string appears in the reference list more than once. • DIFF: The comparison result between the toPMID column and the toPMID2 column. • bestID: The best source link ID (e.g., PMID) of the citation. • bestSRC: The sources that confirm the best ID. • Match: Matching result produced by Patci. [1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885 • Supplementary_File_1.zip – This file contains the code for generating the dataset.
c
Dataset of Springer Nature Group related journals.
repository.cam.ac.uk
bin, txt
Updated Oct 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malin, Niamh (2022). Dataset of Springer Nature Group related journals. [Dataset]. http://doi.org/10.17863/CAM.90061
Explore at:
txt(4288 bytes), bin(1200224 bytes), bin(847930 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.90061
Dataset updated
Oct 31, 2022
Dataset provided by
Apollo
University of Cambridge
Authors
Malin, Niamh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset compiles the identifying details of Springer Nature Group published journals. This covers journals published by Springer Nature, Nature Portfolio, Palgrave Macmillan, Springer, BioMed Central, and Scientific American, It accounts for both current (as of Oct. 22) and archived journals.

It compiles identifiers of each journal where possible: Title, Alternative title, eBook and Print ISSN, Title ID, Active years, Primary language, Publisher and imprint, Access type, Default licence and Platform URL.

In particular aid for Springer Negotiations of 2022/23, it identifies the ISSN used via UnSub.org (and Jisc) for relevant journals, a URL link to the editorial board of current journals, and a subject area has also been assigned to each journal.

This dataset is designed to further aid data analysis by universitites in preparations for the Springer Negotiations 2022/23 (following the success of the Elsevier Negotiations in 2021/22).

Further information and sources can be found in the attached ReadMe file. Both CSV and XLSX files contain the same data in two different formats.
f
Data from: Knowledge organization in the dynamics of research in the...
scielo.figshare.com
figshare.com
jpeg
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliana Lazzarotto Freitas; Bruna Silva do Nascimento; Leilah Santiago Bufrem (2023). Knowledge organization in the dynamics of research in the scientific literature of articles at Brapci [Dataset]. http://doi.org/10.6084/m9.figshare.7515698.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7515698.v1
Dataset updated
May 30, 2023
Dataset provided by
SciELO journals
Authors
Juliana Lazzarotto Freitas; Bruna Silva do Nascimento; Leilah Santiago Bufrem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The aim of this paper is to analyze the research dimensions expressed in the literature in the field of knowledge organization. We consider that scientific investigation developed in certain social contexts and historical conjunctures reflects the changes and contradictions present in these contexts, both in its organization and applications. This process is illustrated from a corpus of 105 articles retrieved from the referential database of journal articles on Information Science between 2003 and 2012. Endnote and Excel software were used for organizing and analyzing the data collected. The articles are categorized according to their object and methodological approach, as well as their theoretical foundations through the study of citations. Throughout this investigation, we point out the most expressive and active authors in the field of knowledge organization in Brazil. We also identify the journals that are more devoted to the theme and we conclude that Brazilian scientific production on the theme is irregularly distributed over the period of time analyzed, peaking in 2011 when 18,27% of the total number of articles were published. We conclude by establishing the relationship between theoretical currents of knowledge organization and the prevalent approaches and themes found in the corpus. The focus of the most prevalent research approaches were the theoretical and linguistic approaches and the type of analysis was documentary.
E
Methodology data of "Twenty years of research in Digital Humanities: a topic...
live.european-language-grid.eu
data.niaid.nih.gov
json
Updated Sep 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Methodology data of "Twenty years of research in Digital Humanities: a topic modeling study" [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18310
Explore at:
jsonAvailable download formats
Dataset updated
Sep 10, 2022
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This document contains the datasets created in the thesis "Twenty years of research in Digital Humanities: a topic modeling study". The methodological approach of the work is based on two datasets built by web scraping DH journals’ official web pages and API requests to popular academic databases (Crossref, Datacite). The datasets constitute a corpus of DH research and include research papers abstracts and abstract papers from DH journals and international DH conferences published between 2000 and 2020. Probabilistic topic modeling with latent Dirichlet allocation is then performed on both datasets to identify relevant research subfields.
Data
Folder "data/" contains four folders which relate to two datasets:
The first dataset, which will be referred to as the journals dataset, contains original research papers published in journals exclusively devoted to digital humanities scholarshipis [1] and is composed of 2,464 articles from 26 journals.
The second dataset, the conference dataset, contains abstract papers available in ADHO conference archives and is composed of 2,160 articles from 15 years of ADHO conferences and 4 conferences promoted by journals
Both datasets are provided with: URL (if available); identifier and related scheme (if available); abstract or abstract paper; title; authors’ given name, family name; author’s affiliation name, found within the document metadata or text; normalized affiliation name, country of the affiliation, identifiers of the affiliation provided by the Research Organization Registry Community (ROR, https://ror.org); publisher (if available); publishing date (complete date when provided or only the year); keywords (if available); journal title; volume and issue (if available); electronic and/or print ISSN (if available).
The two folders "data/no_abstracts..." are licensed under a Creative Commons public domain dedication (CC0), while the others keep their original license (the one provided by their publisher) because they contain full abstracts of the papers. These latter datasets are provided in order to favor the reproducibility of the results obtained in our work.
Topic modeling

"topic_modeling/" directory contains input and output data used within MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [2]. The topic modeling results are divided in two folders, one for each of the datasets.
Note: It's necessary to unzip the file to get access to all the files and directories listed below.
References
Spinaci, G., Colavizza, G., Peroni, S., Preliminary Results on Mapping Digital Humanities Research, in: Atti del IX Convegno Annuale AIUCD. La svolta inevitabile: sfide e prospettive per l'Informatica Umanistica, Milan, Università Cattolica del Sacro Cuore, 2020, pp. 246 - 252 (atti di: IX Convegno Annuale AIUCD. La svolta inevitabile: sfide e prospettive per l'Informatica Umanistica, Milano, Italy, 15-17 gennaio 2020)
Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3
o
Abstracts for Topic Prediction Dataset
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Abstracts for Topic Prediction Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/83eaad55-f12c-4461-bcc3-d748fda9ded2
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
This dataset provides a curated collection of research article abstracts and titles, designed to facilitate topic modelling, tagging, and advanced search and recommendation systems within large online archives of scientific literature. Researchers often face challenges in identifying relevant articles, and this dataset aims to streamline that process by enabling the prediction of article topics. Each research article within the dataset can be associated with one or more topics, reflecting the multidisciplinary nature of contemporary research. The abstracts are drawn from six distinct academic fields: Computer Science, Mathematics, Physics, Statistics, Quantitative Biology, and Quantitative Finance.

Columns

Title: The title of the research article.

Abstract: The abstract of the research article, providing a summary of its content.

Topics: The assigned topic or topics for the research article. An article may have multiple topics.

Distribution

The dataset is typically provided in a CSV (Comma Separated Values) format. Specific numbers for rows or records are not currently available. It is structured to support the analysis of research article abstracts and titles for topic identification, with articles categorised across six primary scientific disciplines.

Usage

This dataset is ideal for developing and evaluating machine learning models for natural language processing (NLP), specifically for topic modelling and classification tasks. Key applications include: * Building systems that can automatically tag research articles with relevant keywords or subjects. * Developing recommendation engines that suggest pertinent articles to researchers based on their interests. * Enhancing search functionalities in digital libraries and academic databases. * Training models to predict the topics for new research articles, given their abstract and title.

Coverage

The dataset's coverage is global, encompassing research articles without a specified temporal range for the articles themselves. The data pertains to abstracts from six academic topics: Computer Science, Mathematics, Physics, Statistics, Quantitative Biology, and Quantitative Finance. No specific demographic scope is applicable to the research articles themselves.

License

CC0

Who Can Use It

This dataset is primarily intended for researchers, data scientists, and machine learning engineers involved in: * Academic research: For studying and developing new methods in NLP and information retrieval. * Educational analytics: To understand and categorise scholarly output. * AI and Machine Learning development: For training and testing algorithms that process and classify textual data, particularly in the context of scientific literature. * Data product developers: To build features like smart search, content recommendation, or automated content organisation for academic platforms.

Dataset Name Suggestions

Research Article Topics

Scientific Paper Abstracts

Academic Topic Classifier

Multi-Discipline Article Topics

Abstracts for Topic Prediction

Attributes

Original Data Source: Research Articles Dataset
Open access practices of selected library science journals
data.niaid.nih.gov
datadryad.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jennifer Jordan; Blair Solon; Stephanie Beene (2025). Open access practices of selected library science journals [Dataset]. http://doi.org/10.5061/dryad.pvmcvdnt3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.pvmcvdnt3
Dataset updated
May 7, 2025
Dataset provided by
University of New Mexico
Authors
Jennifer Jordan; Blair Solon; Stephanie Beene
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The data in this set was gathered to analyze the open access practices of library journals. The data was culled from the Directory of Open Access Journals (DOAJ), the Proquest database Library and Information Science Abstracts (LISA), and a sample of peer reviewed scholarly journals in the field of Library Science. Starting with a batch of 377 journals, the researchers focused their dataset to include journals that met the following criteria: 1) peer-reviewed 2) written in English or abstracted in English, 3) actively published at the time of analysis, and 4) scoped to librarianship. The dataset presents an overview of the landscape of open access scholarly publishing in the LIS field during a very specific time period, spring and summer of 2023. Methods Data Collection The researchers gathered 377 scholarly journals whose content covered the work of librarians, archivists, and affiliated information professionals. This data encompassed 222 journals from the Proquest database Library and Information Science Abstracts (LISA), widely regarded as an authoritative database in the field of librarianship. From the Directory of Open Access Journals, we included 144 LIS journals. We also included 11 other journals not indexed in DOAJ or LISA, based on the researchers’ knowledge of existing OA library journals. The data is separated into several different sets representing the different indices and journals we searched. The first set includes journals from the database LISA. The following fields are in this dataset:

Journal: title of the journal

Publisher: title of the publishing company

Publisher Type: the kind of publisher, whether association, traditional, university library, or independent

Country of publication: country where the journal is published

Region: geographical place of publication

Open Data Policy: lists whether an open data exists and what the policy is

Open Data Notes: descriptions of the open data policies Open ranking: details whether the journal is diamond, gold, and/or green

Open peer review: specifies if the journal does open peer review

Author retains copyright: explains copyright policy

APCs: Details whether there is an article processing charge

In DOAJ: details whether the journal is also published in the Directory of Open Access Journals

The second set includes similar as the previous set, but it also includes two additional columns:

Type of CC: lists the Creative Commons license applied to the journal articles

In LISA: details whether the journal is also published in the Library and Information Science Abstracts database

A third dataset includes eleven scholarly, peer reviewed journals focused on Library and Information Science that were not in DOAJ or LISA. This dataset is also labeled with the same fields as the first dataset. The fourth dataset is the complete list of 377 journals that we evaluated for inclusion in this dataset. Data Processing To explore the current state of OA scholarly publishing in librarianship, we developed the following criteria: Journals must be published at the time of analysis, peer reviewed, and scoped to librarianship and must have articles or abstracts in English so that we could determine the journal’s scope. After applying inclusion/exclusion criteria, 145 of 377 journals remained; however, the total number of journals analyzed is 133 because the DOAJ and LISA shared 12 journals. The researchers explored the open data policies, open access publication options, country of origin, publisher, and peer review process of each of the remaining 133 journals. The researchers also looked for article processing costs, type of Creative Commons licensing (open licenses that allow users to redistribute and sometimes remix intellectual property), and whether the journals were included in either the DOAJ and/or LISA index. References: Budapest Open Access Initiative. (2002) http://www.soros.org/openaccess/
Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...
zenodo.org
data.niaid.nih.gov
Updated Apr 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5996864
Dataset updated
Apr 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
Description
Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation
Document level disaggregated data about evaluation and publication delay in...
datos.cchs.csic.es
csv, txt
Updated Apr 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agencia Estatal Consejo Superior de Investigaciones Científicas (CSIC) (2025). Document level disaggregated data about evaluation and publication delay in Ibero-American scientific journals (2018-2020) - Datos abiertos CCHS [Dataset]. http://doi.org/10.20350/digitalCSIC/14628
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.20350/digitalCSIC/14628
Dataset updated
Apr 16, 2025
Dataset provided by
Spanish National Research Councilhttp://www.csic.es/
Authors
Agencia Estatal Consejo Superior de Investigaciones Científicas (CSIC)
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Document level disaggregated data of the review, acceptance and publication dates of a sample of 21890 articles from 326 Ibero-American scientific journals from all subject areas and countries included in the Latindex Catalogue 2.0 and published between 2018 and 2020. The variable included are: document identifier; identifier of the journal section in which the document was include; literal of journal section in which the document was include; source data; year of the publication of the article; reception date of the article; acceptance date of the article; publication date of the article; days between reception date and acceptance date; days between acceptance date and publication date; days between reception date and publication date; identifier of the country/region of the journal; literal of the country/region of the journal; subject area identifier; subject area literal; journal periodicity identifier; journal periodicity; journal identifier; ISSN; journal title. Descripción: Document level disaggregated dataset about evaluation and publication delay in Ibero-Aamerican scientific journals (2018-2020). This dataset has a Creative Commons BY-NC-SA 4.0 licence
Data from: Inventory of online public databases and repositories holding...
s.cnmilf.com
datadiscoverystudio.org
+4more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, _domain-specific databases, and the top journals compare how much data is in institutional vs. _domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find _domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known _domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were _domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of _domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared _domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the _domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
d
Data for: Integrating open education practices with data analysis of open...
search.dataone.org
data.niaid.nih.gov
Updated Jul 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marja Bakermans (2024). Data for: Integrating open education practices with data analysis of open science in an undergraduate course [Dataset]. http://doi.org/10.5061/dryad.37pvmcvst
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.37pvmcvst
Dataset updated
Jul 27, 2024
Dataset provided by
Dryad Digital Repository
Authors
Marja Bakermans
Description
The open science movement produces vast quantities of openly published data connected to journal articles, creating an enormous resource for educators to engage students in current topics and analyses. However, educators face challenges using these materials to meet course objectives. I present a case study using open science (published articles and their corresponding datasets) and open educational practices in a capstone course. While engaging in current topics of conservation, students trace connections in the research process, learn statistical analyses, and recreate analyses using the programming language R. I assessed the presence of best practices in open articles and datasets, examined student selection in the open grading policy, surveyed students on their perceived learning gains, and conducted a thematic analysis on student reflections. First, articles and datasets met just over half of the assessed fairness practices, but this increased with the publication date. There was a..., Article and dataset fairness To assess the utility of open articles and their datasets as an educational tool in an undergraduate academic setting, I measured the congruence of each pair to a set of best practices and guiding principles. I assessed ten guiding principles and best practices (Table 1), where each category was scored â€˜1â€™ or â€˜0â€™ based on whether it met that criteria, with a total possible score of ten. Open grading policies Students were allowed to specify the percentage weight for each assessment category in the course, including 1) six coding exercises (Exercises), 2) one lead exercise (Lead Exercise), 3) fourteen annotation assignments of readings (Annotations), 4) one final project (Final Project), 5) five discussion board posts and a statement of learning reflection (Discussion), and 6) attendance and participation (Participation). I examined if assessment categories (independent variable) were weighted (dependent variable) differently by students using an analysis of ..., , # Data for: Integrating open education practices with data analysis of open science in an undergraduate course

Author: Marja H Bakermans Affiliation: Worcester Polytechnic Institute, 100 Institute Rd, Worcester, MA 01609 USA ORCID: https://orcid.org/0000-0002-4879-7771 Institutional IRB approval: IRB-24â€“0314

Data and file overview

The full dataset file called OEPandOSdata (.xlsx extension) contains 8 files. Below are descriptions of the name and contents of each file. NA = not applicable or no data available

BestPracticesData.csv

Description: Data to assess the adherence of articles and datasets to open science best practices.

Column headers and descriptions:

Article: articles used in the study, numbered randomly

F1: Findable, Data are assigned a unique and persistent doi

F2: Findable, Metadata includes an identifier of data

F3: Findable, Data are registered in a searchable database

A1: ...
Z
Conceptualization of public data ecosystems
data.niaid.nih.gov
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin, Lnenicka (2024). Conceptualization of public data ecosystems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13842001
Explore at:
Dataset updated
Sep 26, 2024
Dataset provided by
Anastasija, Nikiforova
Martin, Lnenicka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems" conducted by Martin Lnenicka (University of Hradec Králové, Czech Republic), Anastasija Nikiforova (University of Tartu, Estonia), Mariusz Luterek (University of Warsaw, Warsaw, Poland), Petar Milic (University of Pristina - Kosovska Mitrovica, Serbia), Daniel Rudmark (Swedish National Road and Transport Research Institute, Sweden), Sebastian Neumaier (St. Pölten University of Applied Sciences, Austria), Karlo Kević (University of Zagreb, Croatia), Anneke Zuiderwijk (Delft University of Technology, Delft, the Netherlands), Manuel Pedro Rodríguez Bolívar (University of Granada, Granada, Spain).

As there is a lack of understanding of the elements that constitute different types of value-adding public data ecosystems and how these elements form and shape the development of these ecosystems over time, which can lead to misguided efforts to develop future public data ecosystems, the aim of the study is: (1) to explore how public data ecosystems have developed over time and (2) to identify the value-adding elements and formative characteristics of public data ecosystems. Using an exploratory retrospective analysis and a deductive approach, we systematically review 148 studies published between 1994 and 2023. Based on the results, this study presents a typology of public data ecosystems and develops a conceptual model of elements and formative characteristics that contribute most to value-adding public data ecosystems, and develops a conceptual model of the evolutionary generation of public data ecosystems represented by six generations called Evolutionary Model of Public Data Ecosystems (EMPDE). Finally, three avenues for a future research agenda are proposed.

This dataset is being made public both to act as supplementary data for "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems ", Telematics and Informatics*, and its Systematic Literature Review component that informs the study.

Description of the data in this data set

PublicDataEcosystem_SLR provides the structure of the protocol

Spreadsheet#1 provides the list of results after the search over three indexing databases and filtering out irrelevant studies

Spreadsheets #2 provides the protocol structure.

Spreadsheets #3 provides the filled protocol for relevant studies.

The information on each selected study was collected in four categories:(1) descriptive information,(2) approach- and research design- related information,(3) quality-related information,(4) HVD determination-related information

Descriptive Information

Article number

A study number, corresponding to the study number assigned in an Excel worksheet

Complete reference

The complete source information to refer to the study (in APA style), including the author(s) of the study, the year in which it was published, the study's title and other source information.

Year of publication

The year in which the study was published.

Journal article / conference paper / book chapter

The type of the paper, i.e., journal article, conference paper, or book chapter.

Journal / conference / book

Journal article, conference, where the paper is published.

DOI / Website

A link to the website where the study can be found.

Number of words

A number of words of the study.

Number of citations in Scopus and WoS

The number of citations of the paper in Scopus and WoS digital libraries.

Availability in Open Access

Availability of a study in the Open Access or Free / Full Access.

Keywords

Keywords of the paper as indicated by the authors (in the paper).

Relevance for our study (high / medium / low)

What is the relevance level of the paper for our study

Approach- and research design-related information

Approach- and research design-related information

Objective / Aim / Goal / Purpose & Research Questions

The research objective and established RQs.

Research method (including unit of analysis)

The methods used to collect data in the study, including the unit of analysis that refers to the country, organisation, or other specific unit that has been analysed such as the number of use-cases or policy documents, number and scope of the SLR etc.

Study’s contributions

The study’s contribution as defined by the authors

Qualitative / quantitative / mixed method

Whether the study uses a qualitative, quantitative, or mixed methods approach?

Availability of the underlying research data

Whether the paper has a reference to the public availability of the underlying research data e.g., transcriptions of interviews, collected data etc., or explains why these data are not openly shared?

Period under investigation

Period (or moment) in which the study was conducted (e.g., January 2021-March 2022)

Use of theory / theoretical concepts / approaches? If yes, specify them

Does the study mention any theory / theoretical concepts / approaches? If yes, what theory / concepts / approaches? If any theory is mentioned, how is theory used in the study? (e.g., mentioned to explain a certain phenomenon, used as a framework for analysis, tested theory, theory mentioned in the future research section).

Quality-related information

Quality concerns

Whether there are any quality concerns (e.g., limited information about the research methods used)?

Public Data Ecosystem-related information

Public data ecosystem definition

How is the public data ecosystem defined in the paper and any other equivalent term, mostly infrastructure. If an alternative term is used, how is the public data ecosystem called in the paper?

Public data ecosystem evolution / development

Does the paper define the evolution of the public data ecosystem? If yes, how is it defined and what factors affect it?

What constitutes a public data ecosystem?

What constitutes a public data ecosystem (components & relationships) - their "FORM / OUTPUT" presented in the paper (general description with more detailed answers to further additional questions).

Components and relationships

What components does the public data ecosystem consist of and what are the relationships between these components? Alternative names for components - element, construct, concept, item, helix, dimension etc. (detailed description).

Stakeholders

What stakeholders (e.g., governments, citizens, businesses, Non-Governmental Organisations (NGOs) etc.) does the public data ecosystem involve?

Actors and their roles

What actors does the public data ecosystem involve? What are their roles?

Data (data types, data dynamism, data categories etc.)

What data do the public data ecosystem cover (is intended / designed for)? Refer to all data-related aspects, including but not limited to data types, data dynamism (static data, dynamic, real-time data, stream), prevailing data categories / domains / topics etc.

Processes / activities / dimensions, data lifecycle phases

What processes, activities, dimensions and data lifecycle phases (e.g., locate, acquire, download, reuse, transform, etc.) does the public data ecosystem involve or refer to?

Level (if relevant)

What is the level of the public data ecosystem covered in the paper? (e.g., city, municipal, regional, national (=country), supranational, international).

Other elements or relationships (if any)

What other elements or relationships does the public data ecosystem consist of?

Additional comments

Additional comments (e.g., what other topics affected the public data ecosystems and their elements, what is expected to affect the public data ecosystems in the future, what were important topics by which the period was characterised etc.).

New papers

Does the study refer to any other potentially relevant papers?

Additional references to potentially relevant papers that were found in the analysed paper (snowballing).

Format of the file.xls, .csv (for the first spreadsheet only), .docx

Licenses or restrictionsCC-BY

For more info, see README.txt
n
Literature review - automotive security
narcis.nl
data.mendeley.com
Updated May 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pethő, Z (via Mendeley Data) (2021). Literature review - automotive security [Dataset]. http://doi.org/10.17632/z4744w5ptv.1
Explore at:
Unique identifier
https://doi.org/10.17632/z4744w5ptv.1
Dataset updated
May 17, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Pethő, Z (via Mendeley Data)
Description
A database from 140 scientific articles (journal and conference papers) from the automotive security domain. In the database, we assigned specific attributes to every article (such as Web of Science Impact Factor or the number of citations). The data set was analyzed by the K-means clustering and decision tree analysis methods to identify and characterize the generated groups of papers.

We did not aim to identify perfectly supplementing categories but to define the relevant research topics of the automotive security domain. Following this, some of the chosen categories may have overlap with other topics, which means that these research categories may be partly laid on common scientific and professional basics. However, all the considered categories can be defined as separate, scientifically significant, and considerably relevant research orientations.

Facebook

Twitter

Click to copy link

Link copied

Cite

Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro (2025). Data articles in journals [Dataset]. http://doi.org/10.5281/zenodo.15553313

Data articles in journals

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

txt, csv, xlsAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15553313

Dataset updated

May 30, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

2025

Description

Version: 6

Date of data collection: May 2025

General description: Publication of datasets according to the FAIR principles could be reached publishing a data paper (and/or a software paper) in data journals as well as in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.

File list:

- data_articles_journal_list_v6.xlsx: full list of 177 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v6.csv: full list of 177 academic journals in which data papers or/and software papers could be published
- readme_v6.txt, with a detailed descritption of the dataset and its variables.

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 6th version
- Information updated: number of journals (17 were added and 4 were deleted), URL, document types associated to a specific journal.
- Information added: diamond journals were identified.

Version: 5

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2023/09/05

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:

- data_articles_journal_list_v5.xlsx: full list of 162 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v5.csv: full list of 162 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 5th version
- Information updated: number of journals, URL, document types associated to a specific journal.
163 journals (excel y csv)

Version: 4

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2022/12/15

- data_articles_journal_list_v4.xlsx: full list of 140 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v4.csv: full list of 140 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 4th version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR), Scopus and Web of Science (WOS), Journal Master List.

Version: 3

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2022/10/28

- data_articles_journal_list_v3.xlsx: full list of 124 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_3.csv: full list of 124 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 3rd version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR).

Erratum - Data articles in journals Version 3:

Botanical Studies -- ISSN 1999-3110 -- JCR (JIF) Q2
Data -- ISSN 2306-5729 -- JCR (JIF) n/a
Data in Brief -- ISSN 2352-3409 -- JCR (JIF) n/a

Version: 2

Author: Francisco Rubio, Universitat Politècnia de València.

Date of data collection: 2020/06/23

- data_articles_journal_list_v2.xlsx: full list of 56 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v2.csv: full list of 56 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 2nd version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Scimago Journal and Country Rank (SJR)

Total size: 32 KB

Version 1: Description

This dataset contains a list of journals that publish data articles, code, software articles and database articles.

The search strategy in DOAJ and Ulrichsweb was the search for the word data in the title of the journals.
Acknowledgements:
Xaquín Lores Torres for his invaluable help in preparing this dataset.

Clear search

Close search

Google apps

Main menu

Data articles in journals

Dataset 1: Studies included in literature review

PLOS Open Science Indicators

Open Science for Social Sciences and Humanities: Open Access availability...

Analysis of CBCS publications for Open Access, data availability statements...

An open dataset of scholarly publications highlighted by journal editors

Citation Knowledge with Section and Context

Survey data of "Mapping Research Output to the Sustainable Development Goals...

Data from: OpCitance: Citation contexts identified from the PubMed Central...

Dataset of Springer Nature Group related journals.

Data from: Knowledge organization in the dynamics of research in the...

Methodology data of "Twenty years of research in Digital Humanities: a topic...

Abstracts for Topic Prediction Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Open access practices of selected library science journals

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

Document level disaggregated data about evaluation and publication delay in...

Data from: Inventory of online public databases and repositories holding...

Data for: Integrating open education practices with data analysis of open...

Data and file overview

Conceptualization of public data ecosystems

Literature review - automotive security

Data articles in journalsSee More Versions

Data articles in journals