Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset accompanying the paper "Scientometric analysis and knowledge mapping of literature-based discovery (1986–2020)".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The documents made available present the data set that was processed in Lucas George Wendt's dissertation, presented in 2024 in the Postgraduate Program in Information Science (PPGCIN) of the Federal University of Rio Grande do Sul (UFRGS). The study is entitled: Brazilian Paleontology: a scientometric analysis based on the Lattes Curriculum. The abstract is as follows. This research sought to carry out a scientometric analysis of Paleontology in Brazil based on data collected in the Lattes Curriculum. The general objective of this dissertation is to analyze the scientific field of Paleontology diachronically and through a scientometric study - which will be explained based on the personal information of the researchers collected in their profiles and the scientific literature produced and registered in the Lattes Curriculum of the Lattes Platform. The literature review presented the concepts of Information Science, the area that, in this study, seeks to understand Paleontology through its research instruments; Scientific Communication, the main subject analyzed in this study; Metric Information Studies, the theoretical-methodological framework used in this research; Scientometrics, the theoretical scope used to understand in greater depth the constitution of the field of national Paleontology. Finally, references were also presented that help in the understanding of Paleontology in its national, South American, North American and European contexts. The research used a mixed approach of qualitative and quantitative elements. The data were generated from the CVs of researchers registered on the Lattes Platform, collected using the Brapci Bibliometric Tools tool and analyzed in specific software for metric analysis. To achieve the research objectives, data from 1,465 researcher profiles were analyzed. Regarding the full articles published in journals, 43,333 articles were considered valid. Regarding the keywords of the articles, 91,922 keywords were analyzed for word clouds and 84,771 for relationship networks. Of the academic orientations, 1,182 profiles generated 51,400 valid orientations. The aspect of the current employment relationship had 1,256 profiles considered. Regarding academic backgrounds, 1,465 profiles generated 4,556 academic backgrounds analyzed. The main contribution of this study is the realization of an unprecedented mapping of the panorama of Paleontology in Brazil, since there are no other studies that establish the same relationships that this research sought to establish. Regarding the results, based on the data collected and analyzed, the general metric indicators linked to the scientific production associated with Brazilian Paleontology were presented based on the information collected in the Lattes Curriculum; the directions of research in Paleontology that currently constitute this field in Brazil were mapped, as well as their thematic associations with other fields of knowledge; the training of PhD researchers who work with Paleontology or who have their production associated with Paleontology in terms of their academic training was characterized; and where the scientific knowledge in Paleontology or associated with Paleontology is produced was identified. The results of this study are relevant to understanding Brazilian Paleontology, highlighting its national orientation in fossil studies, doctoral training in local institutions and predominant activity in national organizations. These elements are important to consolidate Brazilian paleontological science globally. Regarding interdisciplinary relations, a clear proximity between Paleontology and Geosciences is observed, influenced by the history and current dynamics of the field. The study is available in full at this link: https://lume.ufrgs.br/handle/10183/278682.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data generated by the authors or analyzed during the study, Appendices, and all the generated networks are posted here.
Location of the institution
The location of the institution has been established by a 2-step process. First, a custom web-scraping tool has been used to find the latitude-longitude data of the institution at Wikipedia. For the institutions that were not successfully identified at Wikipedia, the location was attached using GeoPy (Bing location service).
The following columns are related to location:
LatLonReady: Latitude and longitude data
country: name of the country
continent: name of the continent
cblock: same as country, EU countries as EU28
Business institution
Using the scraping tool developed for Wikipedia search, institutions were identified as business units if Wiki page contained information on Industry / HQ/ Product
business: dummy variable
This file documents the raw data and manual codings (as of July 2022) of 46 journals listed in the Scimago Journal & Country Rank for the search criteria "Subject category: Law", "Region/Country: Germany", and "Year: 2021". These data are the basis of the blog post Hamann/Emmelheinz, "Scopus/Scimago: Useless for Studying Legal Research! An Empirical Assessment of Misclassification Rates in a Popular Scientometric Data Source", Legal|Empirics blog 25 July 2022, DOI 10.25527/re.2022.02.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study examines the evolution of User Experience as a distinct field that emerged from human-computer interaction, analyzing 27,082 records dating back to 1985. Data were collected from the Web of Science Core Collection and examined using the R-tool for statistical analysis and visualization of co-citation networks, and CiteSpace for visualizing citation networks. Our analysis reveals a dynamic body of literature strongly influenced by engineering disciplines, with research progresses together with technological innovation. This development has gradually shifted researchers’ attention toward human perception, as they address challenges posed by increasing technological complexity and rising user expectations regarding quality experience. Moreover, the field demonstrates promising opportunities in healthcare applications and shift from theoretical frameworks to practical implementations. This study represents the first comprehensive scientometric analysis of User Experience research, providing critical insights into its developmental trajectory across various domains and offering a valuable resource for understanding and advancing User Experience research.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This paper analyzes the patterns of health biotechnology publications in six Latin American countries from 2001 to 2015. The countries studied were Argentina, Brazil, Chile, Colombia, Cuba and Mexico. Before our study, there were no data available on HBT development in half of the Latin-American countries we studied, i.e., Argentina, Colombia and Chile. To include these countries in a scientometric analysis of HBT provides fuller coverage of HBT development in Latin America. The scientometric study used the Web of Science database to identify health biotechnology publications. The total amount of health biotechnology production in the world during the period studied was about 400,000 papers. A total of 1.2% of these papers, were authored by the six Latin American countries in this study. The results show a significant growth in health biotechnology publications in Latin America despite some of the countries having social and political instability, fluctuations in their gross domestic expenditure in research and development or a trade embargo that limits opportunities for scientific development. The growth in the field of some of the Latin American countries studied was larger than the growth of most industrialized nations. Still, the visibility of the Latin American research (measured in the number of citations) did not reach the world average, with the exception of Colombia. The main producers of health biotechnology papers in Latin America were universities, except in Cuba were governmental institutions were the most frequent producers. The countries studied were active in international research collaboration with Colombia being the most active (64% of papers co-authored internationally), whereas Brazil was the least active (35% of papers). Still, the domestic collaboration was even more prevalent, with Chile being the most active in such collaboration (85% of papers co-authored domestically) and Argentina the least active (49% of papers). We conclude that the Latin American countries studied are increasing their health biotechnology publishing. This strategy could contribute to the development of innovations that may solve local health problems in the region.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundAsthma is one of the most common chronic diseases in children globally. In recent decades, advances have been made in understanding the mechanism, diagnosis, treatment and management for childhood asthma, but few studies have explored its knowledge structure and future interests comprehensively.ObjectiveThis scientometric study aims to understand the research status and emerging trends of childhood asthma.MethodsCiteSpace (version 5.8.R3) was used to demonstrate national and institutional collaborations in childhood asthma, analyze research subjects and journal distribution, review research keywords and their clusters, as well as detect research bursts.ResultsA total of 14,340 publications related to childhood asthma were extracted from Web of Science (core database) during January 2011 to December 2021. The results showed that academic activities of childhood asthma had increased steadily in the last decade. Most of the research was conducted by developed countries while China, as a developing country, was also actively engaged in this field. In addition to subjects of allergy and immunology, both public health aspects and ecological environmental impacts on the disease were emphasized recently in this research field. Keywords clustering analysis indicated that research on asthma management and atopy was constantly updated and became the two major research focuses recently, as a significant shift in research hotspots from etiology and diagnosis to atopic march and asthma management was identified. Subgroup analysis for childhood asthma management and atopy suggested that caregiver- or physician-based education and interventions were emerging directions for asthma management, and that asthma should be carefully studied in the context of atopy, together with other allergic diseases.ConclusionsThis study presented a comprehensive and systematic overview of the research status of childhood asthma, provided clues to future research directions, and highlighted two significant research trends of asthma management and atopy in this field.
No description was included in this Dataset collected from the OSF
This dataset contains the data used to completed the article under peer review: References:
Cabezas, A.; Milanés, Y.;Alba, R.; Delgado, A.M. (2023). The need to develop tailored tools for improving the quality of thematic bibliometric analyses: Evidence from papers published in Sustainability and Scientometrics. (Article under peer review) Institutions: Spain (Universidad Internacional de La Rioja, Universidad Pablo de Olavide, Hospital Universitario Virgen de las Nieves)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As scientists worldwide search for answers to the overwhelmingly unknown behind the deadly pandemic, the literature concerning COVID-19 has been growing exponentially. Keeping abreast of the body of literature at such a rapidly advancing pace poses significant challenges not only to active researchers but also to society as a whole. Although numerous data resources have been made openly available, the analytic and synthetic process that is essential in effectively navigating through the vast amount of information with heightened levels of uncertainty remains a significant bottleneck. We introduce a generic method that facilitates the data collection and sense-making process when dealing with a rapidly growing landscape of a research domain such as COVID-19 at multiple levels of granularity. The method integrates the analysis of structural and temporal patterns in scholarly publications with the delineation of thematic concentrations and the types of uncertainties that may offer additional insights into the complexity of the unknown. We demonstrate the application of the method in a study of the COVID-19 literature.
Description
unarXive is a scholarly data set containing publications' full-text, annotated in-text citations, and a citation network.
The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.
Typical use cases are
Citation recommendation
Citation context analysis
Bibliographic analyses
Reference string parsing
This version (v3) of our data set is based on all arXiv publications until 2020-07-31 and on the Microsoft Academic Graph as of 2020-08-18. As additional contribution, we included a table with the publication date and the scientific discipline for each paper for easier filtering.
Note: This Zenodo record is an old version of unarXive. You can find the most recent version at https://zenodo.org/record/7752754 and https://zenodo.org/record/7752615
Access
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ D O W N L O A D S A M P L E ┃┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
To download the whole data set send an access request and note the following:
Note: this Zenodo record is a "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹
¹ For information on papers' licenses use arXiv's bulk metadata access.
The code used for generating the data set is publicly available.
Usage examples for our data set are provided at here on GitHub.
Citing
This initial version of unarXive is described in the following journal article.
Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020,[link to an author copy]
The updated version is described in the following conference paper.
Tarek Saier, Michael Färber. "unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network", JCDL 2023.[link to an author copy]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used as part of results of an scientific article whom abstract is:
The relationship between international collaboration and scientific impact is studied in the context of South American universities. This study aims to comprehensively analyze the strength of this relationship using nonparametric statistical methods. The records are the 244300 papers published in journals indexed in Scopus (2011-2020) by researchers affiliated to 10 South American public universities and extracted with Scival support. There is a marked trend of collaborative work, since 93% of publications were collaborative at institutional, national or international level, with a higher percentage of international collaboration. A refined analysis of the geographic collaboration of publications in Q1 journals further evidences the frequency of international collaboration. In the top 4 collaborating partner institutions for each university, the presence of the Centre National de la Recherche Scientifique of France
(CNRS) is observed, followed by the National Council for Scientific and Technical Research of Argentina (Conicet). It is proven that there is a statistically significant relationship (p < .01) in each of the 10 universities between collaboration (number of
countries) and normalized impact (FWCI). The results confirmed the hypothesis of this study and the authors provide practical recommendations for science policy makers and researchers, including the promotion of strategic collaboration between different
institutional sectors of society to increase the impact of publications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset on the scientometrics of teaching in HEA- group learning (Part 2) is presented. It shows supplementary Data that includes keywords, word clouds and retrieved data from SCOPUS database. This dataset was used for the paper by applying Scientometric science, which is based on bibliometric analysis. The data included different parameters like publication authors, and publication's geographical location by country to investigate the research pattern. See the full paper in: Amaechi, C.V.; Amaechi, E.C.; Onumonu, U.P.; Kgosiemang, I.M. Systematic review and Annotated Bibliography on Teaching in Higher Education Academy via Group Learning to adapt with COVID-19. Education Sciences 2022, 16, under review.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes data used in the study on scientometrics for Additive Manufacturing for lattice structure. It shows supplementary Data on Additive Manufacturing (AM) including keywords, word clouds and retrieved data from SCOPUS and Web of Science databases. This dataset was used for the paper by applying Scientometric science, which is based on bibliometric analysis. VOS viewer was used for establishing research patterns, visualising maps and identifying transcendental issues on Additive Manufacturing for lattice structure. Some data on Additive Manufacturing were also included, as the data were subjected to a scientometric study which looked at different parameters like publication authors, and publication's geographical location by country.
This is a dataset used in and produced by research described in the paper titled "Crossref as a source of scientometric data for social & human sciences".
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description unarXive is a scholarly data set containing publications' structured full-text, annotated in-text citations, linked non-text content (mathematical notation, figure/table captions) and a citation network. The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files. Typical uses are
Training of ML models (citation recommendation, summarization, LLMs) Citation context analysis Bibliographic analyses Access ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ D O W N L O A D S A M P L E ┃┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛ Regarding the full data set, please note the following:
Note: this Zenodo record is the "open subset" of unarXive, which contains all permissively licensed papers from arXiv.org. You can find the full version here. The code used for generating the data set is publicly available.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We use journal articles published by Scientometrics in 2016-2020 as the data source. Through the analysis of the data set usage records of scientometrics research, the frequency ranking of the usage of each dataset on patent is listed, so as to provide a reference for the selection of data sets for scientometrics research.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Number publications according to country
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains impact metrics and indicators for a set of publications that are related to the COVID-19 infectious disease and the coronavirus that causes it. It is based on:
Τhe CORD-19 dataset released by the team of Semantic Scholar1 and
Τhe curated data provided by the LitCovid hub2.
These data have been cleaned and integrated with data from COVID-19-TweetIDs and from other sources (e.g., PMC). The result was dataset of 628,506 unique articles along with relevant metadata (e.g., the underlying citation network). We utilized this dataset to produce, for each article, the values of the following impact measures:
Influence: Citation-based measure reflecting the total impact of an article. This is based on the PageRank3 network analysis method. In the context of citation networks, it estimates the importance of each article based on its centrality in the whole network. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4.
Influence_alt: Citation-based measure reflecting the total impact of an article. This is the Citation Count of each article, calculated based on the citation network between the articles contained in the BIP4COVID19 dataset.
Popularity: Citation-based measure reflecting the current impact of an article. This is based on the AttRank5 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). AttRank alleviates this problem incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to read papers which received a lot of attention recently. This is why it is more suitable to capture the current "hype" of an article.
Popularity alternative: An alternative citation-based measure reflecting the current impact of an article (this was the basic popularity measured provided by BIP4COVID19 until version 26). This is based on the RAM6 citation network analysis method. Methods like PageRank are biased against recently published articles (new articles need time to receive their first citations). RAM alleviates this problem using an approach known as "time-awareness". This is why it is more suitable to capture the current "hype" of an article. This measure was calculated using the PaperRanking (https://github.com/diwis/PaperRanking) library4.
Social Media Attention: The number of tweets related to this article. Relevant data were collected from the COVID-19-TweetIDs dataset. In this version, tweets between 23/6/22-29/6/22 have been considered from the previous dataset.
We provide five CSV files, all containing the same information, however each having its entries ordered by a different impact measure. All CSV files are tab separated and have the same columns (PubMed_id, PMC_id, DOI, influence_score, popularity_alt_score, popularity score, influence_alt score, tweets count).
The work is based on the following publications:
COVID-19 Open Research Dataset (CORD-19). 2020. Version 2023-01-10 Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2023-01-10. doi:10.5281/zenodo.3715506
Chen Q, Allot A, & Lu Z. (2020) Keep up with the latest coronavirus research, Nature 579:193 (version 2023-01-10)
R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019
I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)
Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380
A Web user interface that uses these data to facilitate the COVID-19 literature exploration, can be found here. More details in our peer-reviewed publication here (also here there is an outdated preprint version).
Funding: We acknowledge support of this work by the project "Moving from Big Data Management to Data Science" (MIS 5002437/3) which is implemented under the Action "Reinforcement of the Research and Innovation Infrastructure", funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund).
Terms of use: These data are provided "as is", without any warranties of any kind. The data are provided under the Creative Commons Attribution 4.0 International license.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset accompanying the paper "Scientometric analysis and knowledge mapping of literature-based discovery (1986–2020)".