Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Methods for the construction of the corpus of citation contexts
We used Semantic Scholar (https://www.semanticscholar.org), an academic database encompassing over 200 million scholarly documents from diverse sources including publishers, data providers, and web crawlers. Using the specific paper identifier for Fanelli's 2009 publication (d9db67acc223c9bd9b8c1d4969dc105409c6dfef), we queried the Semantic Scholar API (https://www.semanticscholar.org/product/api) to retrieve available citation contexts. Citation contexts were extracted from the "contexts" field within the JSON response pages, (see technical specifications here: https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/get_graph_get_paper_citations).
The query looks like this: https://api.semanticscholar.org/graph/v1/paper/d9db67acc223c9bd9b8c1d4969dc105409c6dfef/citations?fields=title,year,publicationVenue,externalIds,contexts,intents,isInfluential,abstract&offset=1&limit=100
The broad coverage of Semantic Scholar does not imply that citation contexts are always retrieved. The Semantic Scholar API provided citation contexts for only 48% of the 1452 documents citing the paper. To get more, we identified open access papers among the remaining 52% citing papers, retrieved their PDF location and downloaded the files. We used Unpaywall API (https://unpaywall.org/products/api), which is a database to be queried with a DOI in order to get open access information about a document. The query looks like: https://api.unpaywall.org/v2/10.1162/qss_a_00220?email=mail@example.com
We downloaded 266 PDF files and converted them to text format using an online bulk PDF-to-text converter (https://overbits.herokuapp.com/pdftotext/). These files were then processed using TXM https://txm.gitpages.huma-num.fr/textometrie/en/Presentation/), a specialized textual analysis tool. We used its concordancer function to identify the term"Fanelli" as a pivot term and check the reference being the good one (the 2009 paper in PlosOne). We did manual cleaning and appended the citation contexts to the previous corpus.
Through this comprehensive methodology, we ultimately identified 824 citation contexts, representing 54% (784) of all documents citing Fanelli's 2009 paper. This corpus comprised 48% of contexts retrieved from Semantic Scholar and an additional 6% obtained through semi-manual extraction from open access documents. 87 of those contexts were excluded from the analysis for a range of reasons including: context too short to conclude, language neither English nor French (shared languages of the authors of this review), duplicate documents (e.g. preprints), etc, leaving us with 737 contexts. They were first classified manually in two categories, those mentioning the 2% figure and those which did not. Then, for the first category, they were further classified manually in two categories depending on whether the figure was appropriately assigned to self-reporting of researchers or rather misleadingly suggesting that the 2% applied to research outputs.
File structure
The file is an .xlsx file composed of three sheets. The first sheet entitled "citcontext (RAW DATA)" includes all information retrieved from the process described above. The second sheet entitled "Excluded from analysis" shows the 87 records excluded from analysis with brief descriptions of the reasons for exclusion. The 737 contexts analysed are showed in the third sheet ("Analysis of citcontext") together with the classifications described above.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for our systematic mapping study on API Deprecation:
L. Bonorden and M. Riebisch, "API Deprecation: A Systematic Mapping Study," in 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022), Maspalomas, Spain, 2022.
We provide the decisive criteria for excluded studies and the complete classification for included studies.
Furthermore, we provide a list of related works—i.e., references and citations identified through snowballing on the included studies. The IDs in the data set relate to entries in the Semantic Scholar Academic Graph, which may be accessed via semanticscholar.org/paper/{ID}.
(The list of related works has been added in version 2 of this data set.)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data collection
This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv.
The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.
The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints.
Some general statistics and visualisations per research discipline are provided in the original article (Okamura, to appear), where the validity and limitations associated with the dataset are also discussed.
Description of columns (variables)
arxiv_id : arXiv ID
category : Research discipline
pre_year : Year of posting v1 on arXiv
pub_year : Year of DOI acquisition
c_tot : No. of citations acquired during 1991–2019
c_pre : No. of citations acquired before and including the year of DOI acquisition
c_pub : No. of citations acquired after the year of DOI acquisition
c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)
gamma : The quantitatively-and-temporally normalised citation index
gamma_star : The quantitatively-and-temporally standardised citation index
Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, to appear). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times.
Data files
A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a dataset for classifying citation intents in academic papers. The main citation intent label for each Json object is specified with the label key while the citation context is specified in with a context key. Example: { 'string': 'In chacma baboons, male-infant relationships can be linked to both formation of friendships and paternity success [30,31].' 'sectionName': 'Introduction', 'label': 'background', 'citingPaperId': '7a6b2d4b405439', 'citedPaperId': '9d1abadc55b5e0', ... } You may obtain the full information about the paper using the provided paper ids with the Semantic Scholar API (https://api.semanticscholar.org/). The labels are: Method, Background, Result
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains frequency counts of target words in 175 million academic abstracts published in all fields of knowledge. We quantify the prevalence of words denoting prejudice against ethnicity, gender, sexual orientation, gender identity, minority religious sentiment, age, body weight and disability in SSORC abstracts over the period 1970-2020. We then examine the relationship between the prevalence of such terms in the academic literature and their concomitant prevalence in news media content. We also analyze the temporal dynamics of an additional set of terms associated with social justice discourse in both the scholarly literature and in news media content. A few additional words not denoting prejudice are also available since they are used in the manuscript for illustration purposes. The list of academic abstracts analyzed in this work was taken from the Semantic Scholar Open Research Corpus (SSORC). The corpus contains, as of 2020, over 175 million academic abstracts, and associated metadata, published in all fields of knowledge. The raw data is provided by Semantic Scholar in accessible JSON format. Textual content included in our analysis is circumscribed to the scholarly articles’ titles and abstracts and does not include other article elements such as main body of text or references section. Thus, we use frequency counts derived from academic articles’ titles and abstracts as a proxy for word prevalence in those articles. This proxy was used because the SSORC corpus does not provide the entire text body of the indexed articles. Targeted textual content was located in JSON data and sorted by year to facilitate chronological analysis. Tokens were lowercased prior to estimating frequency counts. Yearly relative frequencies of a target word or n-gram in the SSORC corpus were estimated by dividing the number of occurrences of the target word/n-gram in all scholarly articles within a given year by the total number of all words in all articles of that year. This method of estimating word frequencies accounts for variable volume of total scientific output over time. This approach has been shown before to accurately capture the temporal dynamics of historical events and social trends in news media corpora. It is possible that a small percentage of scholarly articles in the SSORC corpus contain incorrect or missing data. For earlier years in the SSORC corpus, abstract information is sometimes missing and only article’s title information is available. As a result, the total and target word count metrics for a small subset of academic abstracts might not be precise. In a data analysis of 175 million scientific abstracts, manually checking the accuracy of frequency counts for every single academic abstract is unfeasible and hundred percent accuracy at capturing abstracts’ content might be elusive due to a small number of erroneous outlier cases in the raw data. Overall, however, we are confident that our frequency metrics are representative of word prevalence in academic content as illustrated by Figure 2 in the main manuscript, which shows the chronological prevalence in the SSORC corpus of several terms associated with different disciplines of scientific/academic knowledge. Factor analysis of frequency counts time series was carried out only after Bartlett’s test of sphericity and Kaiser-Meyer-Olkin (KMO) test confirmed the suitability of the data for factor analysis. A single factor derived from the frequency counts time series of prejudice-denoting terms was extracted from each corpus (academic abstracts and news media content). The same procedure was applied for the terms denoting social justice discourse. A factor loading cutoff of 0.5 was used to ascribe terms to a factor. Chronbach alphas to determine if the resulting factors appeared coherent were extremely high (>0.95). The textual content of news and opinion articles from the outlets listed in Figure 5 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from original sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions. Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time. The list of compressed files in this data set is listed next: -analysisScripts.rar contains the analysis scripts used in the main manuscript and raw data metrics -scholarlyArticlesContainingTargetWords.rar contains the IDs of each analyzed abstract in the SSORC corpus and the counts of target words and total words for each scholarly article -targetWordsInMediaArticlesCounts.rar contains counts of target words in news outlets articles as well as total counts of words in articles In a small percentage of news articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles might not be precise. In a data analysis of millions of news articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Rozado, Al-Gharbi, and Halberstadt, “Prevalence of Prejudice-Denoting Words in News Media Discourse" for supporting evidence). 31/08/2022 Update: There is a new way to download the Semantic Scholar Open Research Corpus (see https://github.com/allenai/s2orc). This updated version states that the corpus contains 136M+ paper nodes. However, when I downloaded a previous version of the corpus in 2021 from http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I counted 175M unique identifiers. The URL of the previous version of the corpus is no longer active, but it has been cached by the Internet Archive at https://web.archive.org/web/20201030131959/http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I haven't had the time to look at the specific reason for the mismatch but perhaps the newer version of the corpus has cleaned a lot of noisy entries in the previous version which often contained entries with missing abstracts. Filtering out entries in low prevalence languages other than English might be another reason. In any case, Figure 2 of the main manuscript of this work (at https://www.nas.org/academic-questions/35/2/themes-in-academic-literature-prejudice-and-social-justice) should provide support for the validity of the frequency counts.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The Scientific Literature Comparison Table (SLCT) dataset was collected using the arXiv and Semantic Scholar APIs. It underwent a series of processing steps, including manual inspection and editing.
The processing steps are summarized as follows:
1) Downloading Survey Papers’ LaTeX files using the Arxiv API.
2) Preprocessing LaTeX files to HTML format.
3) Extracting tables from the HTML files.
4) Creating a Golden Table as a reference.
5) Generating descriptions for column headers.
6) Acquiring citation data.
7) Finalizing the dataset.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset to support the PATHOS project case study analisys - RCAAP Portugal. Dataset data sources: 1) OpenAIRE Research Graph (August 2024) /API; 2) Semantic Scholar (April 2024); 3) ROR.org (October 2024); 4) PATSTAT (Spring 2024); 5) ORBIS. Preprocessing Steps: 1) Filtered for articles from 2015- present; 2) Geographical coverage: RCAAP affiliated repositories + authors with affiliations = Portugal; 3) Number of publications of final collection: 466,519.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Methods for the construction of the corpus of citation contexts
We used Semantic Scholar (https://www.semanticscholar.org), an academic database encompassing over 200 million scholarly documents from diverse sources including publishers, data providers, and web crawlers. Using the specific paper identifier for Fanelli's 2009 publication (d9db67acc223c9bd9b8c1d4969dc105409c6dfef), we queried the Semantic Scholar API (https://www.semanticscholar.org/product/api) to retrieve available citation contexts. Citation contexts were extracted from the "contexts" field within the JSON response pages, (see technical specifications here: https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/get_graph_get_paper_citations).
The query looks like this: https://api.semanticscholar.org/graph/v1/paper/d9db67acc223c9bd9b8c1d4969dc105409c6dfef/citations?fields=title,year,publicationVenue,externalIds,contexts,intents,isInfluential,abstract&offset=1&limit=100
The broad coverage of Semantic Scholar does not imply that citation contexts are always retrieved. The Semantic Scholar API provided citation contexts for only 48% of the 1452 documents citing the paper. To get more, we identified open access papers among the remaining 52% citing papers, retrieved their PDF location and downloaded the files. We used Unpaywall API (https://unpaywall.org/products/api), which is a database to be queried with a DOI in order to get open access information about a document. The query looks like: https://api.unpaywall.org/v2/10.1162/qss_a_00220?email=mail@example.com
We downloaded 266 PDF files and converted them to text format using an online bulk PDF-to-text converter (https://overbits.herokuapp.com/pdftotext/). These files were then processed using TXM https://txm.gitpages.huma-num.fr/textometrie/en/Presentation/), a specialized textual analysis tool. We used its concordancer function to identify the term"Fanelli" as a pivot term and check the reference being the good one (the 2009 paper in PlosOne). We did manual cleaning and appended the citation contexts to the previous corpus.
Through this comprehensive methodology, we ultimately identified 824 citation contexts, representing 54% (784) of all documents citing Fanelli's 2009 paper. This corpus comprised 48% of contexts retrieved from Semantic Scholar and an additional 6% obtained through semi-manual extraction from open access documents. 87 of those contexts were excluded from the analysis for a range of reasons including: context too short to conclude, language neither English nor French (shared languages of the authors of this review), duplicate documents (e.g. preprints), etc, leaving us with 737 contexts. They were first classified manually in two categories, those mentioning the 2% figure and those which did not. Then, for the first category, they were further classified manually in two categories depending on whether the figure was appropriately assigned to self-reporting of researchers or rather misleadingly suggesting that the 2% applied to research outputs.
File structure
The file is an .xlsx file composed of three sheets. The first sheet entitled "citcontext (RAW DATA)" includes all information retrieved from the process described above. The second sheet entitled "Excluded from analysis" shows the 87 records excluded from analysis with brief descriptions of the reasons for exclusion. The 737 contexts analysed are showed in the third sheet ("Analysis of citcontext") together with the classifications described above.