OpenAlex is a fully open catalog of the global research system. It's named after the ancient library of Alexandria and made by the non-profit OurResearch.
The OpenAlex dataset describes scholarly entities and how those entities are connected to each other. Types of entities include works, authors, sources, institutions, topics, publishers, and funders. Together, these make a huge web (or more technically, heterogenous directed graph) of hundreds of millions of entities and billions of connections between them all.
OpenAlex offers an open replacement for industry-standard scientific knowledge bases like Elsevier's Scopus and Clarivate's WEb of Science. Compared to these paywalled services, OpenAlex offers significant advantages in terms of inclusivity, affordability, and availability.
The data here are derived from the snapshot data, which is updated about once per month. The raw data are stored on Amazon S3 in the publicly available openalex bucket as gzip-compressed JSON lines files. We use custom functions in Python code to flatten these records into the relational database hosted here on Redivis.
The live data are also available for free via the REST API.
If you use OpenAlex in research, please cite this paper:
Priem, J., Piwowar, H., & Orr, R. (2022).
OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. ArXiv. https://arxiv.org/abs/2205.01833
Licence Ouverte / Open Licence 2.0https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
License information was derived automatically
This dataset list all the issues on the Github repository https://github.com/dataesr/openalex-affiliations/.The dataset is updated every day at 7 AM GMT through a Github action on this repository https://github.com/dataesr/openalex-affiliations/blob/main/.github/workflows/sync_openalex_affiliations_github_issues.yml.List of corrections of OpenAlex affiliations.Ce jeu de données liste toutes les issue de l'entrepôt Github https://github.com/dataesr/openalex-affiliations/."github_issue_id": integer, issue number according to Github"github_issue_link": string, weblink to the Github issue"state": string, current state of the issue, can be "open" or "closed". An issue is closed if it has been ingested by OpenAlex."date_opened": date, when the issue has been opened, eg "2024-11-15""date_closed": date, when the issue has been closed, null if not closed, eg "2024-11-18""raw_affiliation_name": string, raw affiliation string as collected by OpenAlex,"has_added_rors": boolean, if the correction suggest the add of a ROR, 1 if true, 0 if false"has_removed_rors": boolean, if the correction suggest the of a ROR, 1 if true, 0 if false"new_rors": string, list of corrected RORs, separated by ";""previous_rors": string, list of RORs before correction, separated by ";""added_rors": string, list of added RORs after correction, separated by ";""removed_rors": string, list of removed RORs after correction, separated by ";""openalex_works_examples": string, weblink to OpenAlex work mentionning the affiliation string"searched_between": searched years range, eg "2018 - 2024""contact": string, encrypted version of the email of the author of the correction, only the domain name server is not encrypted"contact_domain": string, domain name server"version": version of the works-magnet app used to collect the correction
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
OpenAlex is a new, fully-open scientific knowledge graph (SKG), launched to replace the discontinued Microsoft Academic Graph (MAG). It contains metadata for 209M works (journal articles, books, etc); 2013M disambiguated authors; 124k venues (places that host works, such as journals and online repositories); 109k institutions; and 65k Wikidata concepts (linked to works via an automated hierarchical multi-tag classifier). The dataset is fully and freely available via a web-based GUI, a full data dump, and high-volume REST API. The resource is under active development and future work will improve accuracy and coverage of citation information and author/institution parsing and deduplication. From: Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. ArXiv. Upload details: Downloaded a copy from the aws endpoint s3://openalex on 2023-08-21. Updates are rolling, so future
albertmartinez/openalex-topic-title-abstract dataset hosted on Hugging Face and contributed by the HF Datasets community
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset list all the issues on the Github repository The dataset is updated every day at 2 AM through a Github action on this repository. List of corrections of OpenAlex affiliations. Ce jeu de données liste toutes les issue de l'entrepôt Github https://github.com/dataesr/openalex-affiliations/. "github_issue_id": integer, issue number according to Github "github_issue_link": string, weblink to the Github issue "state": string, current state of the issue, can be "open" or "closed". An issue is closed if it has been treated by OpenAlex. "date_opened": date, when the issue has been opened, ex "2024-11-15" "date_closed": date, when the issue has been closed, null if not closed, ex "2024-11-18" "raw_affiliation_name": string, raw affiliation string as collected by OpenAlex, "has_added_rors": boolean, if the correction suggest the add of a ROR, 1 if true, 0 if false "has_removed_rors": boolean, if the correction suggest the of a ROR, 1 if true, 0 if false "new_rors": string, list of corrected RORs, separated by ";" "previous_rors": string, list of RORs before correction, separated by ";" "added_rors": string, list of added RORs after correction, separated by ";" "removed_rors": string, list of removed RORs after correction, separated by ";" "openalex_works_examples": string, weblink to OpenAlex work mentionning the affiliation string "contact": string, encrypted version of the email of the author of the correction, only the domain name server is not encrypted "contact_domain": string, domain name server
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Two Datasets: works_published and works_cited for year 2020 from OpenAlex database.Check license https://github.com/ourresearch/openalex-docs/blob/main/license.md "OpenAlex data is made available under the CC0 license. That means it's in the public domain, and free to use in any way you like. We appreciate attribution where it's convenient, but it's not at all necessary. There is one exception: the MAG Format snapshot is released under ODC-BY, as per the original MAG license applied by Microsoft (it reuses their schema). See the LICENSE.txt file in the MAG format snapshot distribution for attribution requirement details."Data Quality Considerations:OpenAlex has improved the accuracy of the data with helps from algorithms and institutions.Our current data quality assessment showed the precision and recall 95%+.The first dataset "works_published", as constructed in the provided sources, refers to the publications authored by individuals affiliated with the University of Arizona (UArizona). The data is retrieved using the OpenAlexR package by querying the OpenAlex database with UArizona's Research Organization Registry (ROR) ID (03m2x1q45) and specific publication date ranges. Key aspects of this dataset:Scope: It contains records of scholarly works associated with UArizona authors, including various publication types such as journals, repositories (like PubMed and arXiv), and others. It is also possible to filter the results to include only "journal" type publications using the primary_location.source.type = "journal" parameter in the oa_fetch function.Temporal Coverage: The sources demonstrate fetching data for specific years (e.g., 2019, 2020, 2021, 2022, 2023).Data Retrieval: The process involves using the oa_fetch function from the openalexR package with the entity="works" parameter and specifying the institutions.ror.Data Structure: Each record in this dataset represents a publication and includes various fields. Certain fields are data frames.Usage: This dataset is used as a starting point for various data analyses and data mining.The second dataset "works_cited", refers to scholarly works cited by the publications within the works_published dataset. It is created by extracting the OpenAlex IDs from the $referenced_works field of the works_published data and then using the oa_fetch function to retrieve the full metadata for these cited works. Key aspects of this dataset:Scope: It includes metadata for a wide range of scholarly works that have been cited by UArizona-affiliated publications. This can encompass articles, books, preprints, book chapters, and other types of scholarly outputs.Data Derivation: The dataset is derived from the referenced_works field of the works_published dataset.Data Structure: Each record in this dataset represents a cited work and contains various fields retrieved by the OpenAlex API.The third file (institution_publications.r) is the source code to get the above dataset.Note the code retrieves additional years in addition to 2020.Usage: Both datasets are crucial for performing publication and citation analysis and mining, including:Identifying the most frequently cited works and journals.Analyzing the journal usage and publisher distribution of cited works.Understanding the scholarly landscape influencing UArizona research.Identifying potential resources for library collections based on citation frequency.Investigating the presence and frequency of citations from specific publishers or to specific works.For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.eduThis item is part of University of Arizona authors' scholarly works published and cited works
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set contains data on core sources and core publications identified in the OpenAlex database (based on the OpenAlex snapshot released on August 30, 2024).
The source code used to identify core sources and core publications in OpenAlex is available in this GitHub repository.
See this report for more information about the identification of core sources and core publications in OpenAlex.
This data set consists of the following tab-delimited files.
source.tsv
source_id
source
source_type
issn_l
is_core_source
n_works
n_core_works
work.tsv
work_id
work_type
pub_year
source_id
doi
is_core_work
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains two samples of "White List" journals for December 2024 and February 2025. The data was collected based on API queries to OpenAlex and parsing the JSON file from the RCNI website where the "White List" is located. The following properties of "White List" journal objects were considered: ISSN, Title, journal level in the "White List", date accepted, journal ID in OpenAlex, database indexing information (Web of Science, Scopus, DOAJ), publisher and country of publishing, total citations, open access data, leading thematic topic, leading thematic field, authors from which countries published in the journal and the number of publications (if Russian authors were among them, an additional request was made for the number of publications in the journal with Russian authors by year).
These are SQLite database files that can be opened in DB Browser for SQLite.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Aggregated data underlying the blogpost:More open abstracts? Comparing abstract coverage in Crossref and OpenAlexhttps://bmkramer.github.io/SesameOpenScience_site/thought/202411_open_abstracts/The dataset contains the following files:
abstracts_crossref_openalex_202410.csv
abstracts_crossref_openalex_data_dictionary.txt
The csv file contains data on abstract coverage for Crossref DOIs in Crossref and OpenAlex, aggregated by publisher, for the top 1000 publishers in terms of number of retrieved dois. Scope is limited to publications with Crossref type 'journal-articles' and publication years 2022-2024. Variables are described in the data dictionary included in this record.This analysis was performed using Curtin Open Knowledge Initiative (COKI) infrastructure, which is documented on GitHub: https://github.com/The-Academic-Observatory. Here, a number of open data sources (including Crossref, OpenAlex and OpenAIRE) are ingested into a Google Big Query environment, which can then be queried via SQL.The following data sources were used:
Crossref (Metadata Plus snaphot 2024-10-31, Crossref member route API 2024-11-20)
OpenAlex (data snapshot 2024-10-30)
The code used to generate the dataset is available on GitHub: https://github.com/bmkramer/more_open_abstracts
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Graphs are uploaded in gml format, which can be easily imported by networkx, gephi and neo4j.
The undirected graphs are constructed from openalex database (last update Dic 2022), where nodes are authors and edges specify whether or not two nodes coauthored at least one paper. We avoided papers with more than 10 authors since they are very scarse and could affect the posterior analysis of the networks.
The attributes of the nodes consist of the list of used words for each author and its frequency and all the concepts (of every level) the papers of the author are labeld.
The attributes of the edges only contains the number of papers published between two authors.
For more information about openalex concepts, visit https://docs.openalex.org/api-entities/concepts
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set contains an algorithmic classification of research publications based on data from OpenAlex. The classification is based on the OpenAlex snapshot released on November 21, 2023.
To build the classification, we used the so-called extended direct citation approach in combination with the Leiden algorithm. The source code of our software is available here. The classification covers the 71 million journal articles, proceedings papers, preprints, and book chapters in OpenAlex that were published between 2000 and 2023 and that are connected to each other by citation links. Based on 1715 million citation links, we built a three-level hierarchical classification. Each publication was assigned to a cluster at each of the three levels of the classification. Clusters consist of publications that are relatively strongly connected by citation links and that can therefore be expected to be topically related. At each level of the classification, a publication was assigned to only one cluster, which means clusters do not overlap.
The classification consists of 4521 micro clusters at the lowest (most granular) level, 917 meso clusters at the middle level, and 20 macro clusters at the highest (least granular) level. We also algorithmically linked each cluster in the classification to one or more of the following five broad main fields: biomedical and health sciences, life and earth sciences, mathematics and computer science, physical sciences and engineering, and social sciences and humanities.
We used the Updated GPT 3.5 Turbo large language model, developed by OpenAI, to label the 4521 micro clusters at the lowest level in the classification. The source code of our software can be found here.
See this blog post for more information about the classification.
The classification, including the labels of the micro clusters, is available in the following tab-delimited files.
clustering.tsv
main_field.tsv
macro_cluster.tsv
macro_cluster_main_field.tsv
meso_cluster.tsv
meso_cluster_main_field.tsv
meso_cluster_source.tsv
micro_cluster.tsv
micro_cluster_main_field.tsv
micro_cluster_keyword.tsv
micro_cluster_source.tsv
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset, compiled by the German Kompetenznetzwerk Bibliometrie, provides access to curated bibliometric data in OpenAlex focussing on the German research landscape.
Curated data is provided for following entities:
- Address information
- Publishers
- Funding information
- Document types
- Transformative agreements
- Authors (tba)
For an overview about the tables included, see data-overview.md
This release is based on the August 2024 snapshot of OpenAlex. The OPENBIB snapshot is offered in both CSV and JSONL format.
This is a initial release to demonstrate the current state of metadata curation. The aim is to continue these efforts and improve the curation together with the community and data providers.
Data is made available under the CC0 license.
Github repository: https://github.com/kbopenbib/kbopenbib_data/
sumukshashidhar-archive/openalex-old dataset hosted on Hugging Face and contributed by the HF Datasets community
https://creativecommons.org/public-domain/https://creativecommons.org/public-domain/
This dataset belongs to a paper about independent researchers submitted for the STI conference 2024 (https://sti2024.org/). It consists of several files described below. The data is from OpenAlex, collected through the InSySPo instance of the february snapshot of OpenAlex, hosted on Google Cloud. Since Topics are a new feature of OpenAlex data and therefore not part of the snapshot, this data as well as some other data not available at the InSySPo instance at the time of collection have been collected through the OpenAlex API, and incorporated in the files. Data from Scopus and Web of Science may be retrieved by using the search string in the appendix of the article.
Files all domains
240307_open_alex_works.tsv
contains all works retrieved with the search string for Independent researchers in OpenAlex in the article's appendix.
Files Social Sciences and/or Arts & Humanities
240312_open_alex_works_soc_sci_arts_2010.tsv
contains articles by Independent researchers in Social Sciences and Humanities published from 2010 and retrieved from OpenAlex.
240312_open_alex_authors_soc_sci_arts_2010.tsv
contains authors who are Independent researchers in Social Sciences and Humanities published from 2010 and retrieved from OpenAlex.
240313_open_alex_authors_all_works_soc_sci_arts_2010.tsv
contains all works by Independent researchers in Social Sciences and Humanities published from 2010 and retrieved from OpenAlex. All works mean that the researcher has at least once indicated independent status in the affiliation, and the author's other works are also included.
author_distribution_domain1.csv
contains number of works per number of authors in the domain Social Sciences (includes Arts & Humanities).
author_distribution_field33.csv
contains number of works per number of authors in the field Social Sciences.
author_distribution_field12.csv
contains number of works per number of authors in the field Arts & Humanities.
all_ssh_oa.csv
contains data for analyzing open access patterns for the domain Social Sciences (includes Arts & Humanities).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author name disambiguation V3 initial clusters for the OpenAlex dataset. See https://openalex.org
There are 633803287 rows, split into 4 CSV (comma-delimited) files (with headers).
The CSV files have two columns: "work_author_id" and "author_id"
"work_author_id": An OpenAlex Work ID and an author sequence number, joined with an underscore ("_")
"author_id": An OpenAlex Author ID, representing a unique author in OpenAlex
The works-magnet aims at getting visible the AI-processed metadata for scholarly outputs and help curators improve those metadata. This dataset lists all the corrections asked by the works-magnet users to improve OpenAlex affiliations metadata.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset presents a curated collection of over 700 open-access research articles in which Gephi was used as a primary tool for network analysis. The records were extracted using OpenAlex, cleaned and organized to facilitate exploration by students, researchers, and educators. The goal is to provide a reliable and accessible bibliography for those seeking to understand how Gephi has been applied in diverse research contexts. An interactive dashboard was built using Looker Studio to allow users to filter and visualize the dataset by topic, year, journal, and other dimensions. This resource supports academic work by helping users find methodological references and examples of Gephi applications in scholarly research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A corpus of AI research from OpenAlex. Includes:
A works table with metadata about AI papers
An authors table with information about the authors
An institutions table with information about institutions
A concepts table with information about concepts in works
A MeSH table with information about MeSH terms in works
A concepts json with the OpenAlex concept taxonomy
An abstracts json with deinverted abstracts
A citations json with citations from papers
See ai_openalex_description.md
for data dictionaries.
See ai_openalex_methodology.md
for a description of the method used to create the dataset.
See here for additional information: https://github.com/nestauk/ai_genomics
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
5 Separate files used in the OpenAlex (https://openalex.org) V3 Author Name Disambiguation Model Creation:
ORCID_hard_negative_pairs: Pairs of ORCIDs where either the full name, family name, or given name are a match and would therefore be more difficult to disambiguate.
Disambiguator_all_possible_training_data: Dataset created which contains all possible features for modeling and all possible samples of data. Eventually, this was split into train/val/test and also processed more to create a better balance of positive to negative samples for our purposes.
Disambiguator_final_train_data: Final data which the disambiguator was trained on.
Disambiguator_final_val_data: Data which was used to test the model during training to optimize the features/hyperparameters chosen.
Disambiguator_final_test_data: Final dataset which gave model performance indication after all hyperparameters were tuned and features were chosen.
More details can be found at https://github.com/ourresearch/openalex-name-disambiguation
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This package contains linked datasets of OpenAire Research Graph and OpenAlex.
Files descriptions:
author_to_publication_dic.json contains a mapping of authors to their publications
downloads_views_dic.json contains mappings of the publication id to the number of its downloads and views
id_doi_dic.json contains a mapping of the publication id to its doi
merged1..5.json contain all publication data from the OARG dataset
necessary_fields_dic.json contains extracted publications’ fields necessary for the work
oarg_ref_rel_dic.json contains mapping of publication id to referenced and related work present in OpenAlex dataset
openalex_found_publications5_4.json contains all data on found publications from the OpenAlex
publication_to_author_dic.json contains a mapping of publications to their authors
OpenAlex is a fully open catalog of the global research system. It's named after the ancient library of Alexandria and made by the non-profit OurResearch.
The OpenAlex dataset describes scholarly entities and how those entities are connected to each other. Types of entities include works, authors, sources, institutions, topics, publishers, and funders. Together, these make a huge web (or more technically, heterogenous directed graph) of hundreds of millions of entities and billions of connections between them all.
OpenAlex offers an open replacement for industry-standard scientific knowledge bases like Elsevier's Scopus and Clarivate's WEb of Science. Compared to these paywalled services, OpenAlex offers significant advantages in terms of inclusivity, affordability, and availability.
The data here are derived from the snapshot data, which is updated about once per month. The raw data are stored on Amazon S3 in the publicly available openalex bucket as gzip-compressed JSON lines files. We use custom functions in Python code to flatten these records into the relational database hosted here on Redivis.
The live data are also available for free via the REST API.
If you use OpenAlex in research, please cite this paper:
Priem, J., Piwowar, H., & Orr, R. (2022).
OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. ArXiv. https://arxiv.org/abs/2205.01833