The Semantic Scholar corpus (S2) is composed of titles from scientific papers published in machine learning conferences and journals from 1985 to 2017, split by year (33 timesteps).
Image Source: http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/
This repository provides full-text and metadata to the ACL anthology collection (80k articles/posters as of September 2022) also including .pdf files and grobid extractions of the pdfs.
How is this different from what ACL anthology provides and what already exists?
We provide pdfs, full-text, references and other details extracted by grobid from the PDFs while ACL Anthology only provides abstracts. There exists a similar corpus call ACL Anthology Network but is now showing its age with just 23k papers from Dec 2016.
The goal is to keep this corpus updated and provide a comprehensive repository of the full ACL collection.
This repository provides data for 80,013 ACL articles/posters -
📖 All PDFs in ACL anthology : size 45G download here 🎓 All bib files in ACL anthology with abstracts : size 172M download here 🏷️ Raw grobid extraction results on all the ACL anthology pdfs which includes full text and references : size 3.6G download here 💾 Dataframe with extracted metadata (table below with details) and full text of the collection for analysis : size 489M download here
Column name | Description |
---|---|
acl_id | unique ACL id |
abstract | abstract extracted by GROBID |
full_text | full text extracted by GROBID |
corpus_paper_id | Semantic Scholar ID |
pdf_hash | sha1 hash of the pdf |
numcitedby | number of citations from S2 |
url | link of publication |
publisher | - |
address | Address of conference |
year | - |
month | - |
booktitle | - |
author | list of authors |
title | title of paper |
pages | - |
doi | - |
number | - |
volume | - |
journal | - |
editor | - |
isbn | - |
import pandas as pd
df = pd.read_parquet('acl-publication-info.74k.parquet')
df
acl_id abstract full_text corpus_paper_id pdf_hash ... number volume journal editor isbn
0 O02-2002 There is a need to measure word similarity whe... There is a need to measure word similarity whe... 18022704 0b09178ac8d17a92f16140365363d8df88c757d0 ... None None None None None
1 L02-1310 8220988 8d5e31610bc82c2abc86bc20ceba684c97e66024 ... None None None None None
2 R13-1042 Thread disentanglement is the task of separati... Thread disentanglement is the task of separati... 16703040 3eb736b17a5acb583b9a9bd99837427753632cdb ... None None None None None
3 W05-0819 In this paper, we describe a word alignment al... In this paper, we describe a word alignment al... 1215281 b20450f67116e59d1348fc472cfc09f96e348f55 ... None None None None None
4 L02-1309 18078432 011e943b64a78dadc3440674419821ee080f0de3 ... None None None None None
... ... ... ... ... ... ... ... ... ... ... ...
73280 P99-1002 This paper describes recent progress and the a... This paper describes recent progress and the a... 715160 ab17a01f142124744c6ae425f8a23011366ec3ee ... None None None None None
73281 P00-1009 We present an LFG-DOP parser which uses fragme... We present an LFG-DOP parser which uses fragme... 1356246 ad005b3fd0c867667118482227e31d9378229751 ... None None None None None
73282 P99-1056 The processes through which readers evoke ment... The processes through which readers evoke ment... 7277828 924cf7a4836ebfc20ee094c30e61b949be049fb6 ... None None None None None
73283 P99-1051 This paper examines the extent to which verb d... This paper examines the extent to which verb d... 1829043 6b1f6f28ee36de69e8afac39461ee1158cd4d49a ... None None None None None
73284 P00-1013 Spoken dialogue managers have benefited from u... Spoken dialogue managers have benefited from u... 10903652 483c818c09e39d9da47103fbf2da8aaa7acacf01 ... None None None None None
[73285 rows x 21 columns]
The provided ACL id is consistent with S2 API as well -
https://api.semanticscholar.org/graph/v1/paper/ACL:P83-1025
The API can be used to fetch more information for each paper in the corpus.
Text generation on Huggingface We fine-tuned the distilgpt2 model from huggingface using the full-text from this corpus. The model is trained for generation task.
Text Generation Demo : https://huggingface.co/shaurya0512/distilgpt2-finetune-acl22
Example:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("shaurya0512/distilgpt2-finetune-acl22")
model = AutoModelForCausalLM.from_pretrained("shaurya0512/distilgpt2-finetune-acl22")
input_context = "We introduce a new language representation"
input_ids = tokenizer.encode(input_context, return_tensors="pt") # encode input context
outputs = model.generate(
... input_ids=input_ids, max_length=128, temperature=0.7, repetition_penalty=1.2
... ) # generate sequences
print(f"Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
text
Generated: We introduce a new language representation for the task of sentiment classification. We propose an approach to learn representations from
unlabeled data, which is based on supervised learning and can be applied in many applications such as machine translation (MT) or information retrieval
systems where labeled text has been used by humans with limited training time but no supervision available at all. Our method achieves state-oftheart
results using only one dataset per domain compared to other approaches that use multiple datasets simultaneously, including BERTScore(Devlin et al.,
2019; Liu & Lapata, 2020b ) ; RoBERTa+LSTM + L2SRC -
TODO
Link the acl corpus to semantic scholar(S2), sources like S2ORC Extract figures and captions from the ACL corpus using pdffigures - scientific-figure-captioning Have a release schedule to keep the corpus updated. ACL citation graph Enhance metadata with bib file mapping - include authors Add citation counts for papers Use ForeCite to extract impactful keywords from the corpus Link datasets using paperswithcode? - don't know how useful this is Have some stats about the data - linguistic-diversity; geo-diversity; if possible explorer
We are hoping that this corpus can be helpful for analysis relevant to the ACL community.
Please cite/star 🌟 this page if you use this corpus
Citing the ACL Anthology Corpus If you use this corpus in your research please use the following BibTeX entry:
@Misc{acl_anthology_corpus, author = {Shaurya Rohatgi}, title = {ACL Anthology Corpus with Full Text}, howpublished = {Github}, year = {2022}, url = {https://github.com/shauryr/ACL-anthology-corpus} }
https://img.shields.io/badge/Buy%20Me%20a%20Coffee-ffdd00?style=for-the-badge&logo=buy-me-a-coffee&logoColor=black">
Acknowledgements We thank Semantic Scholar for providing access to the citation related data in this corpus.
License ACL anthology corpus is released under the CC BY-NC 4.0. By using this corpus, you are agreeing to its usage terms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data collection
This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv.
The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.
The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints.
Some general statistics and visualisations per research discipline are provided in the original article (Okamura, to appear), where the validity and limitations associated with the dataset are also discussed.
Description of columns (variables)
arxiv_id : arXiv ID
category : Research discipline
pre_year : Year of posting v1 on arXiv
pub_year : Year of DOI acquisition
c_tot : No. of citations acquired during 1991–2019
c_pre : No. of citations acquired before and including the year of DOI acquisition
c_pub : No. of citations acquired after the year of DOI acquisition
c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)
gamma : The quantitatively-and-temporally normalised citation index
gamma_star : The quantitatively-and-temporally standardised citation index
Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, to appear). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times.
Data files
A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Semantic Scholar: Manajemen Proyek
Batch Fetched from Semantic Scholar API, with parameters:
query = '"manajemen proyek" | "project management" | "metodologi manajemen proyek" | "teknik manajemen proyek" | "alat manajemen proyek" | "keterampilan manajemen proyek" | "tantangan manajemen proyek" | "risiko manajemen proyek" | "studi kasus manajemen proyek" | "tren manajemen proyek" | "manajemen proyek di berbagai industri" | "manajemen proyek dalam organisasi" | "manajemen proyek… See the full description on the dataset page: https://huggingface.co/datasets/derhan/semantic-scholar-manajemen-proyek.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for our systematic mapping study on API Deprecation:
L. Bonorden and M. Riebisch, "API Deprecation: A Systematic Mapping Study," in 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022), Maspalomas, Spain, 2022.
We provide the decisive criteria for excluded studies and the complete classification for included studies.
Furthermore, we provide a list of related works—i.e., references and citations identified through snowballing on the included studies. The IDs in the data set relate to entries in the Semantic Scholar Academic Graph, which may be accessed via semanticscholar.org/paper/{ID}.
(The list of related works has been added in version 2 of this data set.)
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a dataset for classifying citation intents in academic papers. The main citation intent label for each Json object is specified with the label key while the citation context is specified in with a context key. Example: { 'string': 'In chacma baboons, male-infant relationships can be linked to both formation of friendships and paternity success [30,31].' 'sectionName': 'Introduction', 'label': 'background', 'citingPaperId': '7a6b2d4b405439', 'citedPaperId': '9d1abadc55b5e0', ... } You may obtain the full information about the paper using the provided paper ids with the Semantic Scholar API (https://api.semanticscholar.org/). The labels are: Method, Background, Result
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The Scientific Literature Comparison Table (SLCT) dataset was collected using the arXiv and Semantic Scholar APIs. It underwent a series of processing steps, including manual inspection and editing.
The processing steps are summarized as follows:
1) Downloading Survey Papers’ LaTeX files using the Arxiv API.
2) Preprocessing LaTeX files to HTML format.
3) Extracting tables from the HTML files.
4) Creating a Golden Table as a reference.
5) Generating descriptions for column headers.
6) Acquiring citation data.
7) Finalizing the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains frequency counts of target words in 175 million academic abstracts published in all fields of knowledge. We quantify the prevalence of words denoting prejudice against ethnicity, gender, sexual orientation, gender identity, minority religious sentiment, age, body weight and disability in SSORC abstracts over the period 1970-2020. We then examine the relationship between the prevalence of such terms in the academic literature and their concomitant prevalence in news media content. We also analyze the temporal dynamics of an additional set of terms associated with social justice discourse in both the scholarly literature and in news media content. A few additional words not denoting prejudice are also available since they are used in the manuscript for illustration purposes.
The list of academic abstracts analyzed in this work was taken from the Semantic Scholar Open Research Corpus (SSORC). The corpus contains, as of 2020, over 175 million academic abstracts, and associated metadata, published in all fields of knowledge. The raw data is provided by Semantic Scholar in accessible JSON format.
Textual content included in our analysis is circumscribed to the scholarly articles’ titles and abstracts and does not include other article elements such as main body of text or references section. Thus, we use frequency counts derived from academic articles’ titles and abstracts as a proxy for word prevalence in those articles. This proxy was used because the SSORC corpus does not provide the entire text body of the indexed articles. Targeted textual content was located in JSON data and sorted by year to facilitate chronological analysis. Tokens were lowercased prior to estimating frequency counts.
Yearly relative frequencies of a target word or n-gram in the SSORC corpus were estimated by dividing the number of occurrences of the target word/n-gram in all scholarly articles within a given year by the total number of all words in all articles of that year. This method of estimating word frequencies accounts for variable volume of total scientific output over time. This approach has been shown before to accurately capture the temporal dynamics of historical events and social trends in news media corpora.
It is possible that a small percentage of scholarly articles in the SSORC corpus contain incorrect or missing data. For earlier years in the SSORC corpus, abstract information is sometimes missing and only article’s title information is available. As a result, the total and target word count metrics for a small subset of academic abstracts might not be precise. In a data analysis of 175 million scientific abstracts, manually checking the accuracy of frequency counts for every single academic abstract is unfeasible and hundred percent accuracy at capturing abstracts’ content might be elusive due to a small number of erroneous outlier cases in the raw data. Overall, however, we are confident that our frequency metrics are representative of word prevalence in academic content as illustrated by Figure 2 in the main manuscript, which shows the chronological prevalence in the SSORC corpus of several terms associated with different disciplines of scientific/academic knowledge.
Factor analysis of frequency counts time series was carried out only after Bartlett’s test of sphericity and Kaiser-Meyer-Olkin (KMO) test confirmed the suitability of the data for factor analysis. A single factor derived from the frequency counts time series of prejudice-denoting terms was extracted from each corpus (academic abstracts and news media content). The same procedure was applied for the terms denoting social justice discourse. A factor loading cutoff of 0.5 was used to ascribe terms to a factor. Chronbach alphas to determine if the resulting factors appeared coherent were extremely high (>0.95).
The textual content of news and opinion articles from the outlets listed in Figure 5 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from original sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet.
Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript and raw data metrics
-scholarlyArticlesContainingTargetWords.rar contains the IDs of each analyzed abstract in the SSORC corpus and the counts of target words and total words for each scholarly article
-targetWordsInMediaArticlesCounts.rar contains counts of target words in news outlets articles as well as total counts of words in articles
In a small percentage of news articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles might not be precise.
In a data analysis of millions of news articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Rozado, Al-Gharbi, and Halberstadt, “Prevalence of Prejudice-Denoting Words in News Media Discourse" for supporting evidence).
31/08/2022 Update: There is a new way to download the Semantic Scholar Open Research Corpus (see https://github.com/allenai/s2orc). This updated version states that the corpus contains 136M+ paper nodes. However, when I downloaded a previous version of the corpus in 2021 from http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I counted 175M unique identifiers. The URL of the previous version of the corpus is no longer active, but it has been cached by the Internet Archive at https://web.archive.org/web/20201030131959/http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I haven't had the time to look at the specific reason for the mismatch but perhaps the newer version of the corpus has cleaned a lot of noisy entries in the previous version which often contained entries with missing abstracts. Filtering out entries in low prevalence languages other than English might be another reason. In any case, Figure 2 of the main manuscript of this work (at https://www.nas.org/academic-questions/35/2/themes-in-academic-literature-prejudice-and-social-justice) should provide support for the validity of the frequency counts.
Parallel to the dataset CORD-19 of scholarly articles, we provide the literature graph LG-covid19-HOTP composed of not only articles (graph nodes) that are relevant to the study of coronavirus, but also in and out citation links (directed graph edges) to base navigation and search among the articles. The article records are related and connected, not isolated. The graph has been updated weekly since March 26, 2020. The current graph includes 28,669 hot-off-the-press (HOTP) articles since January 2020. It contains 402,946 articles and 3,604,234 links. The link-to-node ratio is remarkably higher than some other existing literature graphs. In addition to the dataset we provide more functionalities at lg-covid-19-hotp.cs.duke.edu such as new articles, weekly meta-data analysis in terms of publication growth over time, ranking by citation, and statistical near-neighbor embedding maps by similarity in co-citation, and similarity in co-reference. Since April 11, we have enabled a novel functionality - self-navigated surf-search over the maps. At the site we also take courtesy input of COVID-19 articles that are missing from the current collection. {"references": ["Semantic Scholar Open Research Corpus. 2019. Version 2019-11-01. Retrieved from http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/. Accessed 2019-12-06.", "Elsevier Scopus Citation Overview API. Accessed 2020-03-25.", "COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-20. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2020-03-26. 10.5281/zenodo.3727291", "Crossref REST API. Available at www.crossref.org. Accessed 2020-03-25."]}
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This resource contains the following files:
- `venues.txt`: The venues that were use for selecting PDFs from the [Semantic Scholar Open Research Corpus](http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/) that were published in the last 5 years.
- `extracted-tables.tar.gz`: All tables that we extracted using [Tabula](https://github.com/tabulapdf/tabula) from these PDFs.
- `sample-400.tar.gz`: A sample of these tables which we used for annotation.
- `ontology.ttl`: The annotation ontology in Turtle format.
- `all_metadata.jsonl`: Annotations for this sample in the JSON format described below.
- `labelqueries.csv`: The label queries used for weak annotation, created using the annotation interface. This CSV file contains 6 columns: a numeric ID, the label query template name (`template`), the template slots (`slots`), the label type (`label`), the annotation value (`value`), and a toggle for the interface (`enabled`).
- `labelqueries-sparql-templates.zip`: The label query templates. These are SPARQL queries with slots of the form `{{slot}}`. The templates in `labelqueries.csv` refer to these files.
- `rules.txt`: Datalog rules that we used for entity resolution.
- `tab2know-graph.nt.gz`: The final RDF graph that contains all extracted table structures, predicted table and column classes, and resolved entity links.
http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations
Data derives from F. Palmas'article, published on February 2015. Aim of this paper is to identify stable spawning and nursery grounds in the Sardinia slope region (central western Mediterranean Sea) for Aristaeomorpha foliacea and Aristeus antennatus. This study also generated relevant information on the spatial and temporal distribution of seasonal or persistent aggregations of spawners and recruits, providing scientific elements to suggest the protection of these important resource. For more informations refer to https://pdfs.semanticscholar.org/3be9/f68e6a203df108f29bdd4c19f5d755682a5e.pdf
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for Climate Change NER
The Climate Change NER is an English-language dataset containing 534 abstracts of climate-related papers. They have been sourced from the Semantic Scholar Academic Graph "abstracts" dataset. The abstracts have been manually annotated by classifying climate-related tokens in a set of 13 categories.
Dataset Details
Dataset Description
We introduce a comprehensive dataset for developing and evaluating NLP models tailored towards… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/Climate-Change-NER.
Turkish Product Review Dataset
This data is orinally from https://www.win.tue.nl/~mpechen/projects/smm/#Datasets
BibTeX Citation
If you use this dataset, please cite following paper: @inproceedings{Demirtas2013CrosslingualPD, title={Cross-lingual polarity detection with machine translation}, author={Erkin Demirtas and Mykola Pechenizkiy}, booktitle={wisdom}, year={2013}, url={https://api.semanticscholar.org/CorpusID:3912960} }
task_categories:
-… See the full description on the dataset page: https://huggingface.co/datasets/asparius/Turkish-Product-Review.
Turkish Movie Review Dataset
This data is orinally from https://www.win.tue.nl/~mpechen/projects/smm/#Datasets
BibTeX Citation
If you use this dataset, please cite following paper: @inproceedings{Demirtas2013CrosslingualPD, title={Cross-lingual polarity detection with machine translation}, author={Erkin Demirtas and Mykola Pechenizkiy}, booktitle={wisdom}, year={2013}, url={https://api.semanticscholar.org/CorpusID:3912960} }
task_categories:
-… See the full description on the dataset page: https://huggingface.co/datasets/asparius/Turkish-Movie-Review.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The Semantic Scholar corpus (S2) is composed of titles from scientific papers published in machine learning conferences and journals from 1985 to 2017, split by year (33 timesteps).
Image Source: http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/