The Caselaw Access Project makes 40 million pages of U.S. caselaw freely available online from the collections of Harvard Law School Library.
The CAP citation graph shows the connections between cases in the Caselaw Access Project dataset. You can use the citation graph to answer questions like "what is the most influential case?" and "what jurisdictions cite most often to this jurisdiction?".
Learn More: https://case.law/download/citation_graph/
Access Limits: https://case.law/api/#limits
This dataset includes citations and metadata for the CAP citation graph in CSV format.
The Caselaw Access Project is by the Library Innovation Lab at Harvard Law School Library.
People are using CAP data to create research, applications, and more. We're sharing examples in our gallery.
Cite Grid is the first visualization we've created based on data from our citation graph.
Have something to share? We're excited to hear about it.
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months).
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
We provide an academic graph based on a snapshot of the Microsoft Academic Graph from 26.05.2021. The Microsoft Academic Graph (MAG) is a large-scale dataset containing information about scientific publication records, their citation relations, as well as authors, affiliations, journals, conferences and fields of study. We acknowledge the Microsoft Academic Graph using the URI https://aka.ms/msracad. For more information regarding schema and the entities present in the original dataset please refer to: MAG schema.
MAG for Heterogeneous Graph Learning We use a recent version of MAG from May 2021 and extract all relevant entities to build a graph that can be directly used for heterogeneous graph learning (node classification, link prediction, etc.). The graph contains all English papers, published after 1900, that have been cited at least 5 times per year since the time of publishing. For fairness, we set a constant citation bound of 100 for papers published before 2000. We further include two smaller subgraphs, one containing computer science papers and one containing medicine papers.
Nodes and features We define the following nodes:
paper with mag_id, graph_id, normalized title, year of publication, citations and a 128-dimension title embedding built using word2vec No. of papers: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
author with mag_id, graph_id, normalized name, citations No. of authors: 6,363,201 (all), 1,797,980 (medicine), 557,078 (computer science);
field with mag_id, graph_id, level, citations denoting the hierarchical level of the field where 0 is the highest-level (e.g. computer science) No. of fields: 199,457 (all), 83,970 (medicine), 45,454 (computer science);
affiliation with mag_id, graph_id, citations No. of affiliations: 19,421 (all), 12,103 (medicine), 10,139 (computer science);
venue with mag_id, graph_id, citations, type denoting whether conference or journal No. of venues: 24,608 (all), 8,514 (medicine), 9,893 (computer science).
Edges We define the following edges:
author is_affiliated_with affiliation No. of author-affiliation edges: 8,292,253 (all), 2,265,728 (medicine), 665,931 (computer science);
author is_first/last/other paper No. of author-paper edges: 24,907,473 (all), 5,081,752 (medicine), 1,269,485 (computer science);
paper has_citation_to paper No. of author-affiliation edges: 142,684,074 (all), 16,808,837 (medicine), 4,152,804 (computer science);
paper conference/journal_published_at venue No. of author-affiliation edges: 5,091,690 (all), 1,014,769 (medicine), 367,576 (computer science);
paper has_field_L0/L1/L2/L3/L4 field No. of author-affiliation edges: 47,531,366 (all), 9,403,708 (medicine), 3,341,395 (computer science);
field is_in field No. of author-affiliation edges: 339,036 (all), 138,304 (medicine), 83,245 (computer science);
We further include a reverse edge for each edge type defined above that is denoted with the prefix rev_ and can be removed based on the downstream task.
Data structure The nodes and their respective features are provided as separate .tsv files where each feature represents a column. The edges are provided as a pickled python dictionary with schema:
{target_type: {source_type: {edge_type: {target_id: {source_id: {time } } } } } }
We provide three compressed ZIP archives, one for each subgraph (all, medicine, computer science), however we split the file for the complete graph into 500mb chunks. Each archive contains the separate node features and edge dictionary.
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
PubMedCite is a domain-specific dataset with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient.
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Research in digital fabrication, specifically in 3D concrete printing (3DCP), has seen a substantial increase in publication output in the past five years, making it hard to keep up with the latest developments. The 3dcp.fyi database aims to provide the research community with a comprehensive, up-to-date, and manually curated literature dataset documenting the development of the field from its early beginnings in the late 1990s to its resurgence in the 2010s until today. The data set is compiled using a systematic approach. A thorough literature search was conducted in scientific databases, following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) scheme. This was then enhanced iteratively with non-indexed literature through a snowball citation search. The authors of the articles were assigned unique and persistent identifiers (ORCID® IDs) through a systematic process that combined querying APIs systematically and manually curating data. The works in the data set also include references to other works, as long as those referenced works are also included within the same data set. A citation network graph is created where scientific articles are represented as vertices, and their citations to other scientific articles are the edges. The constructed network graph is subjected to detailed analysis using specific graph-theoretic algorithms, like PageRank. These algorithms evaluate the structure and connections within the graph, yielding quantitative metrics. Currently, the high-quality dataset contains more than 2600 manually curated scientific works, including journal articles, conference articles, books, and theses, with more than 40000 cross-references and 2000 authors, opening up the possibility for more detailed analysis. The data is published on https://3dcp.fyi, ready for import into several reference managers, and is continuously updated. We encourage researchers to enrich the database by submitting their publications, adding missing works, or suggesting new features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains bitcoin transfer transactions extracted from the Bitcoin Mainnet blockchain. Details of the datasets are given below: FILENAME FORMAT: The filenames have the following format: btc-tx- where For example file btc-tx-100000-149999-aa.bz2 and the rest of the parts if any contain transactions from block 100000 to block 149999 inclusive. The files are compressed with bzip2. They can be uncompressed using command bunzip2. TRANSACTION FORMAT: Each line in a file corresponds to a transaction. The transaction has the following format: BLOCK TIME FORMAT: The block time file has the following format: IMPORTANT NOTE: Public Bitcoin Mainnet blockchain data is open and can be obtained by connecting as a node on the blockchain or by using the block explorer web sites such as https://btcscan.org . The downloaders and users of this dataset accept the full responsibility of using the data in GDPR compliant manner or any other regulations. We provide the data as is and we cannot be held responsible for anything. NOTE: If you use this dataset, please do not forget to add the DOI number to the citation. If you use our dataset in your research, please also cite our paper: https://link.springer.com/chapter/10.1007/978-3-030-94590-9_14 @incollection{kilicc2022analyzing, title={Analyzing Large-Scale Blockchain Transaction Graphs for Fraudulent Activities}, author={K{\i}l{\i}{\c{c}}, Baran and {"O}zturan, Can and {\c{S}}en, Alper}, booktitle={Big Data and Artificial Intelligence in Digital Finance}, pages={253--267}, year={2022}, publisher={Springer, Cham} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Уникальный идентификатор https://doi.org/10.5281/zenodo.4718440 Набор данных обновлен Dec 19, 2022 Набор данных предоставлен Zenodo Авторы Can Özturan; Can Özturan; Alper Şen; Alper Şen; Baran Kılıç; Baran Kılıç Лицензия Attribution 4.0 (CC BY 4.0) Информация о лицензии была получена автоматически Описание This dataset contains ether as well as popular ERC20 token transfer transactions extracted from the Ethereum Mainnet blockchain. Only send ether, contract function call, contract deployment transactions are present in the dataset. Miner reward (static block reward) and "uncle block inclusion reward" are added as transactions to the dataset. Transaction fee reward and "uncles reward" are not currently included in the dataset. Details of the datasets are given below: FILENAME FORMAT: The filenames have the following format: eth-tx- where For example file eth-tx-1000000-1099999.txt.bz2 contains transactions from block 1000000 to block 1099999 inclusive. The files are compressed with bzip2. They can be uncompressed using command bunzip2. TRANSACTION FORMAT: Each line in a file corresponds to a transaction. The transaction has the following format: units. ERC20 tokens transfers (transfer and transferFrom function calls in ERC20 contract) are indicated by token symbol. For example GUSD is Gemini USD stable coin. The JSON file erc20tokens.json given below contains the details of ERC20 tokens. Failed transactions are prefixed with "F-". BLOCK TIME FORMAT: The block time file has the following format: erc20tokens.json FILE: This file contains the list of popular ERC20 token contracts whose transfer/transferFrom transactions appear in the data files. ERC20 token list: USDT TRYb XAUt BNB LEO LINK HT HEDG MKR CRO VEN INO PAX INB SNX REP MOF ZRX SXP OKB XIN OMG SAI HOT DAI EURS HPT BUSD USDC SUSD HDG QCAD PLUS BTCB WBTC cWBTC renBTC sBTC imBTC pBTC IMPORTANT NOTE: Public Ethereum Mainnet blockchain data is open and can be obtained by connecting as a node on the blockchain or by using the block explorer web sites such as http://etherscan.io . The downloaders and users of this dataset accept the full responsibility of using the data in GDPR compliant manner or any other regulations. We provide the data as is and we cannot be held responsible for anything. NOTE: If you use this dataset, please do not forget to add the DOI number to the citation. If you use our dataset in your research, please also cite our paper: https://link.springer.com/article/10.1007/s10586-021-03511-0 @article{kilic2022parallel, title={Parallel Analysis of Ethereum Blockchain Transaction Data using Cluster Computing}, journal={Cluster Computing}, author={K{\i}l{\i}{\c{c}}, Baran and {"O}zturan, Can and Sen, Alper}, year={2022}, month={Jan} }
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This is the Microsoft Academic Graph data from 2019-03-02. To get this, you'd normally jump through these hoops: https://docs.microsoft.com/en-us/academic-services/graph/get-started-setup-provisioning
As required by ODC-BY, I acknowledge Microsoft Academic using the URI https://aka.ms/msracad.
You can find out more about the data schema of the Microsoft Academic Graph at: https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema
Since Microsoft docs are covered by different licensing terms, the documentation cannot be provided along with the data.
There were no changes to the files except compressing them with gzip. They were downloaded and checked twice.
After uploading, the md5-hashes from Zenodo match the locally created compressed files.
The compressed files will expand to the following sizes (in bytes):
4563254 Affiliations.txt
16498013834 Authors.txt
2220754 ConferenceInstances.txt
427502 ConferenceSeries.txt
55232571 FieldsOfStudy.txt
5685746 Journals.txt
32387110344 PaperAuthorAffiliations.txt
32326282060 PaperReferences.txt
7763965 PaperResources.txt
60135810372 Papers.txt
22000508908 PaperUrls.txt
--------------------------------
163423619310 total (~152GiB)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description. The NetVote dataset contains the outputs of the NetVote program when applied to voting data coming from VoteWatch (http://www.votewatch.eu/).
These results were used in the following conference papers:
Source code. The NetVote source code is available on GitHub: https://github.com/CompNet/NetVotes.
Citation. If you use our dataset or tool, please cite article [1] above.
@InProceedings{Mendonca2015,
author = {Mendonça, Israel and Figueiredo, Rosa and Labatut, Vincent and Michelon, Philippe},
title = {Relevance of Negative Links in Graph Partitioning: A Case Study Using Votes From the {E}uropean {P}arliament},
booktitle = {2\textsuperscript{nd} European Network Intelligence Conference ({ENIC})},
year = {2015},
pages = {122-129},
address = {Karlskrona, SE},
publisher = {IEEE Publishing},
doi = {10.1109/ENIC.2015.25},
}
-------------------------
Details. This archive contains the following folders:
-------------------------
License. These data are shared under a Creative Commons 0 license.
Contact. Vincent Labatut <vincent.labatut@univ-avignon.fr> & Rosa Figueiredo <rosa.figueiredo@univ-avignon.fr>
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description. This is the data used in the experiment of the following conference paper:
N. Arınık, R. Figueiredo, and V. Labatut, “Signed Graph Analysis for the Interpretation of Voting Behavior,” in International Conference on Knowledge Technologies and Data-driven Business - International Workshop on Social Network Analysis and Digital Humanities, Graz, AT, 2017, vol. 2025. ⟨hal-01583133⟩
Source code. The code source is accessible on GitHub: https://github.com/CompNet/NetVotes
Citation. If you use the data or source code, please cite the above paper.
@InProceedings{Arinik2017, author = {Arınık, Nejat and Figueiredo, Rosa and Labatut, Vincent}, title = {Signed Graph Analysis for the Interpretation of Voting Behavior}, booktitle = {International Conference on Knowledge Technologies and Data-driven Business - International Workshop on Social Network Analysis and Digital Humanities}, year = {2017}, volume = {2025}, series = {CEUR Workshop Proceedings}, address = {Graz, AT}, url = {http://ceur-ws.org/Vol-2025/paper_rssna_1.pdf},}
Details.
----------------------# COMPARISON RESULTSThe 'material-stats' folder contains all the comparison results obtained for Ex-CC and ILS-CC. The csv files associated with plots are also provided.The folder structure is as follows:* material-stats/** execTimePerf: The plot shows the execution time of Ex-CC and ILS-CC based on randomly generated complete networks of different size.** graphStructureAnalysis: The plots show the weights and links statistics for all instances.** ILS-CC-vs-Ex-CC: The folder contains 4 different comparisons between Ex-CC and ILS-CC: Imbalance difference, number of detected clusters, difference of the number of detected clusters, NMI (Normalized Mutual Information)
----------------------Funding: Agorantic FR 3621, FMJH Program Gaspard Monge in optimization and operation research (Project 2015-2842H)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Leading Indicators OECD: Reference Series: Gross Domestic Product: Original Series for the United States was 120.60270 Index 2010=1.00 in October of 2023, according to the United States Federal Reserve. Historically, Leading Indicators OECD: Reference Series: Gross Domestic Product: Original Series for the United States reached a record high of 120.60270 in October of 2023 and a record low of 11.55572 in July of 1947. Trading Economics provides the current actual value, an historical data chart and related indicators for Leading Indicators OECD: Reference Series: Gross Domestic Product: Original Series for the United States - last updated from the United States Federal Reserve on July of 2025.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Federal Debt: Total Public Debt (GFDEBTN) from Q1 1966 to Q1 2025 about public, debt, federal, government, and USA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"Scholarly figures are data visualizations like bar charts, pie charts, line graphs, maps, scatter plots or similar figures. Text extraction from scholarly figures is useful in many application scenarios, since text in scholarly figures often contains information that is not present in the surrounding text. This dataset is a corpus of 121 scholarly figures from the economics domain evaluating text extraction tools. We randomly extracted these figures from a corpus of 288,000 open access publications from EconBiz. The dataset resembles a wide variety of scholarly figures from bar charts to maps. We manually labeled the figures to create the gold standard.
We adjusted the provided gold standard to have a uniform format for all datasets. Each figure is accompanied by a TSV file (tab-separated values) where each entry corresponds to a text line which has the following structure:
X-coordinate of the center of the bounding box in pixel
Y-coordinate of the center of the bounding box in pixel
Width of the bounding box in pixel
Height of the bounding box in pixel
Rotation angle around its center in degree
Text inside the bounding box
In addition we provide the ground truth in JSON format. A schema file is included in each dataset as well. The dataset is accompanied with a ReadMe file with further information about the figures and their origin.
If you use this dataset in your own work, please cite one of the papers in the references."
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This ds-graph represents this information for the Human Reference Atlas Cell Type Populations Effort (Börner et al. 2025). It provides sample registration information submitted by consortium members in single-cell atlassing efforts, including accurate sample sizes and positions (Bueckle et al. 2025). When combined with ref-organ data, this information helps create 3D visual tissue sample placements. Additionally, the sample information is linked to datasets from researchers' assay analyses that offer deeper insights into the tissue samples. The “ds” stands for “dataset.” ds-graphs represent datasets by tissue sample and donor. It is a dataset graph for the Human Reference Atlaspop Universe. It includes all datasets considered for Human Reference Atlaspop (not enriched).
Bibliography:
CHOCOLATE is a benchmark for detecting and correcting factual inconsistency in generated chart captions. It consists of captions produced by six advanced models, which are categorized into three subsets:
LVLM: GPT-4V, Bard (before Gemini) LLM-based Pipeline: DePlot + GPT-4 Fine-tuned Model: ChartT5, MatCha, UniChart
The charts are from two datasets: VisText and the Pew split of Chart-to-Text. In total, CHOCOLATE consists of 1,187 examples. Each instance in CHOCOLATE consists of a caption generated by one of the models and the annotations of the factual errors for each caption sentence.
Paper Information
Paper: https://arxiv.org/abs/2312.10160 Code: https://github.com/khuangaf/CHOCOLATE/ Project: https://khuangaf.github.io/CHOCOLATE
Citation If you use the CHOCOLATE dataset in your work, please kindly cite the paper using this BibTeX:
@misc{huang-etal-2023-do, title = "Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning", author = "Huang, Kung-Hsiang and Zhou, Mingyang and Chan, Hou Pong and Fung, Yi R. and Wang, Zhenhailong and Zhang, Lingyu and Chang, Shih-Fu and Ji, Heng", year={2023}, eprint={2312.10160}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Leading Indicators OECD: Reference series: Gross Domestic Product (GDP): Normalised for the United States was 100.49120 Index in November of 2023, according to the United States Federal Reserve. Historically, Leading Indicators OECD: Reference series: Gross Domestic Product (GDP): Normalised for the United States reached a record high of 102.98160 in May of 1973 and a record low of 92.02608 in May of 2020. Trading Economics provides the current actual value, an historical data chart and related indicators for Leading Indicators OECD: Reference series: Gross Domestic Product (GDP): Normalised for the United States - last updated from the United States Federal Reserve on July of 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Leading Indicators OECD: Reference series: Gross Domestic Product (GDP): Trend for the United States was 119.91160 Index in November of 2023, according to the United States Federal Reserve. Historically, Leading Indicators OECD: Reference series: Gross Domestic Product (GDP): Trend for the United States reached a record high of 119.91160 in November of 2023 and a record low of 11.29785 in February of 1947. Trading Economics provides the current actual value, an historical data chart and related indicators for Leading Indicators OECD: Reference series: Gross Domestic Product (GDP): Trend for the United States - last updated from the United States Federal Reserve on July of 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The graph representation of the 3D Reference Organ for Larynx, Male dataset.
The Caselaw Access Project makes 40 million pages of U.S. caselaw freely available online from the collections of Harvard Law School Library.
The CAP citation graph shows the connections between cases in the Caselaw Access Project dataset. You can use the citation graph to answer questions like "what is the most influential case?" and "what jurisdictions cite most often to this jurisdiction?".
Learn More: https://case.law/download/citation_graph/
Access Limits: https://case.law/api/#limits
This dataset includes citations and metadata for the CAP citation graph in CSV format.
The Caselaw Access Project is by the Library Innovation Lab at Harvard Law School Library.
People are using CAP data to create research, applications, and more. We're sharing examples in our gallery.
Cite Grid is the first visualization we've created based on data from our citation graph.
Have something to share? We're excited to hear about it.