Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/loc-Brightkite.html
Dataset information
Brightkite (http://www.brightkite.com/) was once a location-based social
networking service provider where users shared their locations by
checking-in. The friendship network was collected using their public API,
and consists of 58,228 nodes and 214,078 edges. The network is originally
directed but we have constructed a network with undirected edges when there
is a friendship in both ways. We have also collected a total of 4,491,143
checkins of these users over the period of Apr. 2008 - Oct. 2010.
Dataset statistics
Nodes 58,228
Edges 214,078
Nodes in largest WCC 56739 (0.974)
Edges in largest WCC 212945 (0.995)
Nodes in largest SCC 56739 (0.974)
Edges in largest SCC 212945 (0.995)
Average clustering coefficient 0.1723
Number of triangles 494728
Fraction of closed triangles 0.03979
Diameter (longest shortest path) 16
90-percentile effective diameter 6
Checkins 4,491,143
Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf
Files
File Description
loc-brightkite_edges.txt.gz Friendship network of Brightkite users
loc-brightkite_totalCheckins.txt.gz
Time and location information of check-ins made by users
Example of check-in information
[user][check-in time] [latitude] [longitude] [location id]
58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411
58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411
58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e
58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e
58188 2010-04-06T06:45:19Z 46.521389 14.854444 ddaa40aaa22411
58188 2008-12-30T15:30:08Z 46.522621 14.849618 58e12bc0d67e11
58189 2009-04-08T07:36:46Z 46.554722 15.646667 ddaf9c4ea22411
58190 2009-04-08T07:01:28Z 46.421389 15.869722 dd793f96a22411
The SNAP data set is 0-based, with nodes numbered 0 to 58,227.
In the SuiteSparse Matrix Collection the graph is converted to 1-based.
The Problem.A matrix is the undirected friendship network, where
A(i,j)=1 if person 1+i and person 1+j are friends in the SNAP data set.
There are 4,747,287 checkins in the loc-brightkite_totalCheckins.txt
file, but 6 lines are empty with a user id but no other data (those
are discarded here). In the SuiteSparse Matrix Collection, the checkin
data is held in 5 vectors of length 4,747,281. These are in the
Problem.aux component of the MATLAB struct. The kth entry of each of
these vectors holds the data in the kth line of the
loc-brightkite_totalCheckins.txt file (after deleting the 6 empty lines).
userid: the SNAP user id is an integer in the range 0 to 58,227. It
has been incremented by one, here, to reflect the corresponding
row and column of the Problem.A matrix. It contains 51,406
unique user id's.
checkin_time: a string of length 20
latitude: a double precision number
longitude: a double precision number
location_id: a string of length 61.
https://snap.stanford.edu/data/loc-Gowalla.html
Dataset information
Gowalla (http://www.gowalla.com/) is a location-based social networking
website where users share their locations by checking-in. The friendship
network is undirected and was collected using their public API, and
consists of 196,591 nodes and 950,327 edges. We have collected a total of
6,442,890 check-ins of these users over the period of Feb. 2009 - Oct.
2010.
Dataset statistics
Nodes 196,591
Edges 950,327
Nodes in largest WCC 196591 (1.000)
Edges in largest WCC 950327 (1.000)
Nodes in largest SCC 196591 (1.000)
Edges in largest SCC 950327 (1.000)
Average clustering coefficient 0.2367
Number of triangles 2273138
Fraction of closed triangles 0.007952
Diameter (longest shortest path) 14
90-percentile effective diameter 5.7
Check-ins 6,442,890
Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf
Files
File Description
loc-gowalla_edges.txt.gz Friendship network of Gowalla users
loc-gowalla_totalCheckins.txt.gz Time and location information
of check-ins made by users
Example of check-in information
[user] [check-in time] [latitude] [longitude] [location id]
196514 2010-07-24T13:45:06Z 53.3648119 -2.2723465833 145064
196514 2010-07-24T13:44:58Z 53.360511233 -2.276369017 1275991
196514 2010-07-24T13:44:46Z 53.3653895945 -2.2754087046 376497
196514 2010-07-24T13:44:38Z 53.3663709833 -2.2700764333 98503
196514 2010-07-24T13:44:26Z 53.3674087524 -2.2783813477 1043431
196514 2010-07-24T13:44:08Z 53.3675663377 -2.278631763 881734
196514 2010-07-24T13:43:18Z 53.3679640626 -2.2792943689 207763
196514 2010-07-24T13:41:10Z 53.364905 -2.270824 1042822
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
9 graphs of Autonomous Systems (AS) peering information inferred from Oregon
route-views between March 31 2001 and May 26 2001.
Dataset statistics are calculated for the graph with the lowest (March 31 2001)
and highest (from May 26 2001) number of nodes: Dataset statistics for graph
witdh lowest number of nodes - 3 31 2001)
Nodes 10670
Edges 22002
Nodes in largest WCC 10670 (1.000)
Edges in largest WCC 22002 (1.000)
Nodes in largest SCC 10670 (1.000)
Edges in largest SCC 22002 (1.000)
Average clustering coefficient 0.4559
Number of triangles 17144
Fraction of closed triangles 0.009306
Diameter (longest shortest path) 9
90-percentile effective diameter 4.5
Dataset statistics for graph with highest number of nodes - 5 26 2001
Nodes 11174
Edges 23409
Nodes in largest WCC 11174 (1.000)
Edges in largest WCC 23409 (1.000)
Nodes in largest SCC 11174 (1.000)
Edges in largest SCC 23409 (1.000)
Average clustering coefficient 0.4532
Number of triangles 19894
Fraction of closed triangles 0.009636
Diameter (longest shortest path) 10
90-percentile effective diameter 4.4
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
Files
File Description
* AS peering information inferred from Oregon route-views ...
oregon1_010331.txt.gz from March 31 2001
oregon1_010407.txt.gz from April 7 2001
oregon1_010414.txt.gz from April 14 2001
oregon1_010421.txt.gz from April 21 2001
oregon1_010428.txt.gz from April 28 2001
oregon1_010505.txt.gz from May 05 2001
oregon1_010512.txt.gz from May 12 2001
oregon1_010519.txt.gz from May 19 2001
oregon1_010526.txt.gz from May 26 2001
NOTE: for the UF Sparse Matrix Collection, the primary matrix in this problem
set (Problem.A) is the last matrix in the sequence, oregon1_010526, from May 26
2001.
The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do...
Facebook
TwitterEMR data-mining code such as association rules for order recommendations and outcome predictions and order set evaluation
This project includes the following software/data packages:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html
Dataset information
This is who-trusts-whom network of people who trade using Bitcoin on a
platform called Bitcoin Alpha (http://www.btcalpha.com/). Since Bitcoin
users are anonymous, there is a need to maintain a record of users'
reputation to prevent transactions with fraudulent and risky users. Members
of Bitcoin Alpha rate other members in a scale of -10 (total distrust) to
+10 (total trust) in steps of 1. This is the first explicit weighted signed
directed network available for research.
Dataset statistics
Nodes 3,783
Edges 24,186
Range of edge weight -10 to +10
Percentage of positive edges 93%
Similar network from another Bitcoin platform, Bitcoin OTC, is available at
https://snap.stanford.edu/data/soc-sign-bitcoinotc.html (and as
SNAP/bitcoin-otc in the SuiteSparse Matrix Collection).
Source (citation) Please cite the following paper if you use this dataset:
S. Kumar, F. Spezzano, V.S. Subrahmanian, C. Faloutsos. Edge Weight
Prediction in Weighted Signed Networks. IEEE International Conference on
Data Mining (ICDM), 2016.
http://cs.stanford.edu/~srijan/pubs/wsn-icdm16.pdf
The following BibTeX citation can be used:
@inproceedings{kumar2016edge,
title={Edge weight prediction in weighted signed networks},
author={Kumar, Srijan and Spezzano, Francesca and
Subrahmanian, VS and Faloutsos, Christos},
booktitle={Data Mining (ICDM), 2016 IEEE 16th Intl. Conf. on},
pages={221--230},
year={2016},
organization={IEEE}
}
The project webpage for this paper, along with its code to calculate two
signed network metrics---fairness and goodness---is available at
http://cs.umd.edu/~srijan/wsn/
Files
File Description
soc-sign-bitcoinalpha.csv.gz
Weighted Signed Directed Bitcoin Alpha web of trust network
Data format
Each line has one rating with the following format:
SOURCE, TARGET, RATING, TIME
where
SOURCE: node id of source, i.e., rater
TARGET: node id of target, i.e., ratee
RATING: the source's rating for the target,
ranging from -10 to +10 in steps of 1
TIME: the time of the rating, measured as seconds since Epoch.
Notes on inclusion into the Suite...
Facebook
TwitterMany existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The problem that we address in this paper is the discovery of recurring anomalies and relationships between problem reports that may indicate larger systemic problems. We will illustrate our techniques on data from discrepancy reports regarding software anomalies in the Space Shuttle. These free text reports are written by a number of different people, thus the emphasis and wording vary considerably. With Mehran Sahami from Stanford University, I'm putting together a book on text mining called "Text Mining: Theory and Applications" to be published by Taylor and Francis.
Facebook
TwitterThis collection includes the text and image data underlying Alexander Street Press' North American Indian Drama, 2nd Edition. The collection contains 244 plays by American Indian, First Nation, and Pacific Islander playwrights of the 20th century, as well as issues of the Native Playwrights' Newsletter. The collection represents groups across the United States and Canada, including Cherokee, Métis, Creek, Choctaw, Pembina Chippewa, Ojibway, Lenape, Comanche, Cree, Navajo, Rappahannock, Hawaiian/Samoan, and others.
For a complete list of titles, please see INDR-2E_metadata_FINAL.xlsx (under Supporting files).
This deposit is Stanford Libraries’ local copy of North American Indian Drama, 2nd Edition, which may be used for data mining. If your research requires you to read the text to understand or analyze it, you may use the corresponding ProQuest database, available under Links.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The files are from three different. One of the three datasets (SemEval) is downloaded from SemEval-2014 which was an international workshop on semantic evaluation conducted in Dublin (Ireland). Another dataset is same dataset (Stanford) as used by Marneffe et al. for their work on finding contradictions in text. Another dataset that we use is the PHEME RTE (Recognizing Textual Entailment). The attached dataset consists of annotated dataset into four different types of contradictions. It consists of intermediate results and feature values on our work on conflicting statements detection in text.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/sx-askubuntu.html
Dataset information
This is a temporal network of interactions on the stack exchange web site
Ask Ubuntu (http://askubuntu.com/). There are three different types of
interactions represented by a directed edge (u, v, t):
user u answered user v's question at time t (in the graph sx-askubuntu-a2q)
user u commented on user v's question at time t (in the graph
sx-askubuntu-c2q) user u commented on user v's answer at time t (in the
graph sx-askubuntu-c2a)
The graph sx-askubuntu contains the union of these graphs. These graphs
were constructed from the Stack Exchange Data Dump. Node ID numbers
correspond to the 'OwnerUserId' tag in that data dump.
Dataset statistics (sx-askubuntu)
Nodes 159,316
Temporal Edges 964,437
Edges in static graph 596,933
Time span 2613 days
Dataset statistics (sx-askubuntu-a2q)
Nodes 137,517
Temporal Edges 280,102
Edges in static graph 262,106
Time span 2613 days
Dataset statistics (sx-askubuntu-c2q)
Nodes 79,155
Temporal Edges 327,513
Edges in static graph 198,852
Time span 2047 days
Dataset statistics (sx-askubuntu-c2a)
Nodes 75,555
Temporal Edges 356,822
Edges in static graph 178,210
Time span 2418 days
Source (citation)
Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. "Motifs in Temporal
Networks." In Proceedings of the Tenth ACM International Conference on Web
Search and Data Mining, 2017.
Files
File Description
sx-askubuntu.txt.gz All interactions
sx-askubuntu-a2q.txt.gz Answers to questions
sx-askubuntu-c2q.txt.gz Comments to questions
sx-askubuntu-c2a.txt.gz Comments to answers
Data format
SRC DST UNIXTS
where edges are separated by a new line and
SRC: id of the source node (a user)
TGT: id of the target node (a user)
UNIXTS: Unix timestamp (seconds since the epoch)
...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The graph of routers comprising the Internet can be organized into sub-graphs
called Autonomous Systems (AS). Each AS exchanges traffic flows with some
neighbors (peers). We can construct a communication network of who-talks-to-
whom from the BGP (Border Gateway Protocol) logs.
The data was collected from University of Oregon Route Views Project
(http://www.routeviews.org/) - Online data and reports. The dataset contains
735 daily instances which span an interval of 785 days from November 8 1997 to
January 2 2000. In contrast to citation networks, where nodes and edges only
get added (not deleted) over time, the AS dataset also exhibits both the
addition and deletion of the nodes and edges over time.
Dataset statistics are calculated for the graph with the highest number of
nodes and edges (dataset from January 02 2000):
Dataset statistics
Nodes 6474
Edges 13233
Nodes in largest WCC 6474 (1.000)
Edges in largest WCC 13233 (1.000)
Nodes in largest SCC 6474 (1.000)
Edges in largest SCC 13233 (1.000)
Average clustering coefficient 0.3913
Number of triangles 6584
Fraction of closed triangles 0.009591
Diameter (longest shortest path) 9
90-percentile effective diameter 4.6
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
Files
File Description
as20000102.txt.gz Autonomous Systems graph from January 02 2000
as.tar.gz 735 Autonomous Systems graphs from November 8 1997 to
January 02 2000
NOTE: In the UF collection, the primary matrix (Problem.A) is the
as20000102 matrix from January 02 2000 (the last graph in the sequence).
The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do not come and go. A node that is "gone" simply has no edges.
This is to allow comparisons across each node in the graphs.
Problem.aux.nodenames gives the node numbers of the original problem. So
row/column i in the matrix is always node number Problem.aux.nodenames(i) in
all the graphs.
Problem.aux.G{k} is the kth graph in the sequence.
Problem.aux.Gname(k,:) is the name of the kth graph.
Facebook
TwitterThis collection contains the plain text transcripts of therapy sessions. These transcripts are sourced from Alexander Street Press' Counseling and Psychotherapy Transcripts: Volume II. The collection features a diverse set of clients, a wide range of presenting issues, and multiple therapeutic approaches. Content was recorded in 2012 or later, and the transcripts were generally released between 2013 and 2015. The collection adheres to the American Psychological Association's Ethics Guidelines for use and anonymity.
For a complete list of transcripts, please see CTRN Metadata_QA completed by WS 1.6.21.xlsx (under Supporting files).
This deposit is Stanford Libraries’ local copy of Counseling and Psychotherapy Transcripts: Volume II, which may be used for data mining. If your research requires you to read the text to understand or analyze it, you may use the corresponding ProQuest database, available under Links.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Boolean implications (if-then rules) provide a conceptually simple, uniform and highly scalable way to find associations between pairs of random variables. In this paper, we propose to use Boolean implications to find relationships between variables of different data types (mutation, copy number alteration, DNA methylation and gene expression) from the glioblastoma (GBM) and ovarian serous cystadenoma (OV) data sets from The Cancer Genome Atlas (TCGA). We find hundreds of thousands of Boolean implications from these data sets. A direct comparison of the relationships found by Boolean implications and those found by commonly used methods for mining associations show that existing methods would miss relationships found by Boolean implications. Furthermore, many relationships exposed by Boolean implications reflect important aspects of cancer biology. Examples of our findings include cis relationships between copy number alteration, DNA methylation and expression of genes, a new hierarchy of mutations and recurrent copy number alterations, loss-of-heterozygosity of well-known tumor suppressors, and the hypermethylation phenotype associated with IDH1 mutations in GBM. The Boolean implication results used in the paper can be accessed at http://crookneck.stanford.edu/microarray/TCGANetworks/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is the result of a project to support policy making by providing insights on the availability and composition of education offer in four key digital domains: artificial intelligence, high performance computing, cybersecurity, and data science. Following a text mining methodology that captures the inclusion of advanced digital technologies in the programmes’ syllabus, we monitor the availability of masters’ programmes, bachelor’s programmes and short professional courses and study their characteristics. These include the scope or depth with which the digital content is taught (classified into broad or specialised), education fields in which digital technologies are embedded (e.g., Information and communication technologies, Business, administration and law), and the content areas covered by the programmes (e.g. robotics, machine learning). Also, we consider the overlap between the four domains, to identify complementarities and synergies in the academic offer of advanced digital technologies. The dataset covers yearly data, starting from the academic year 2019-2020 and ending in academic year 2023-24 (and will not be further updated). In order to provide comparison with other competing economies, the dataset covers the EU and its Member States plus six additional countries: the United Kingdom, Norway, Switzerland, Canada, the United States, and Australia. Results of the study have been used as reference in the European Artificial Intelligence Strategy, the White Paper on Artificial Intelligence – a European approach to excellence and trust, in the Stanford University’s Artificial Intelligence Index Report 2019 and 2021. These data have substantiated the assessment of the national Recovery and Resilience plans, and are used as input for the Digital Resilience Dashboard, among others.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
Internet topology graph. From traceroutes run daily in 2005 -
http://www.caida.org/tools/measurement/skitter. From several scattered sources
to million destinations. 1.7 million nodes, 11 million edges.
Dataset statistics
Nodes 1696415
Edges 11095298
Nodes in largest WCC 1694616 (0.999)
Edges in largest WCC 11094209 (1.000)
Nodes in largest SCC 1694616 (0.999)
Edges in largest SCC 11094209 (1.000)
Average clustering coefficient 0.2963
Number of triangles 28769868
Fraction of closed triangles 0.005387
Diameter (longest shortest path) 25
90-percentile effective diameter 5.9
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
Files
File Description
as-skitter.txt.gz AS from traceroutes run daily in 2005 by skitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains labeled, weighted networks of chemical-gene, gene-gene, gene-disease, and chemical-disease relationships based on single sentences in PubMed abstracts. All raw dependency paths are provided in addition to the labeled relationships.
PART I: Connects dependency paths to labels, or "themes". Each record contains a dependency path followed by its score for each theme, and indicators of whether or not the path is part of the flagship path set for each theme (meaning that it was manually reviewed and determined to reflect that theme). The themes themselves are listed below and are in our paper (reference below).
PART II: Connects sentences to dependency paths. It consists of sentences and associated metadata, entity pairs found in the sentences, and dependency paths connecting those entity pairs. Each record contains the following information:
The "with-themes.txt" files only contain dependency paths with corresponding theme assignments from Part I. The plain ".txt" files contain all dependency paths.
This release contains the annotated network for the September 15, 2019 version of PubTator. The version discussed in our paper, below, is an older one - from April 30, 2016. If you're interested in that network, it can be found in Version 1 of this repository. We will be releasing updated networks periodically, as the PubTator community continues to release new versions of named entity annotations for Medline each month or so.
------------------------------------------------------------------------------------
REFERENCES
Percha B, Altman RBA (2017) A global network of biomedical relationships derived from text. Bioinformatics, 34(15): 2614-2624.
Percha B, Altman RBA (2015) Learning the structure of biomedical relationships from unstructured text. PLoS Computational Biology, 11(7): e1004216.
This project depends on named entity annotations from the PubTator project:
https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/
Reference:
Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522.
Dependency parsing was provided by the Stanford CoreNLP toolkit (version 3.9.1):
https://stanfordnlp.github.io/CoreNLP/index.html
Reference:
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
------------------------------------------------------------------------------------
THEMES
chemical-gene
(A+) agonism, activation
(A-) antagonism, blocking
(B) binding, ligand (esp. receptors)
(E+) increases expression/production
(E-) decreases expression/production
(E) affects expression/production (neutral)
(N) inhibits
gene-chemical
(O) transport, channels
(K) metabolism, pharmacokinetics
(Z) enzyme activity
chemical-disease
(T) treatment/therapy (including investigatory)
(C) inhibits cell growth (esp. cancers)
(Sa) side effect/adverse event
(Pr) prevents, suppresses
(Pa) alleviates, reduces
(J) role in disease pathogenesis
disease-chemical
(Mp) biomarkers (of disease progression)
gene-disease
(U) causal mutations
(Ud) mutations affecting disease course
(D) drug targets
(J) role in pathogenesis
(Te) possible therapeutic effect
(Y) polymorphisms alter risk
(G) promotes progression
disease-gene
(Md) biomarkers (diagnostic)
(X) overexpression in disease
(L) improper regulation linked to disease
gene-gene
(B) binding, ligand (esp. receptors)
(W) enhances response
(V+) activates, stimulates
(E+) increases expression/production
(E) affects expression/production (neutral)
(I) signaling pathway
(H) same protein or complex
(Rg) regulation
(Q) production by cell population
------------------------------------------------------------------------------------
FORMATTING NOTE
A few users have mentioned that the dependency paths in the "part-i" files are all lowercase text, whereas those in the "part-ii" files maintain the case of the original sentence. This complicates mapping between the two sets of files.
We kept the part-ii files in the same case as the original sentence to facilitate downstream debugging - it's easier to tell which words in a particular sentence are contributing to the dependency path if their original case is maintained. When working with the part-ii "with-themes" files, if you simply convert the dependency path to lowercase, it is guaranteed to match to one of the paths in the corresponding part-i file and you'll be able to get the theme scores.
Apologies for the additional complexity, and please reach out to us if you have any questions (see correspondence information in the Bioinformatics manuscript, above).
Facebook
Twittera standardized compilation of published marine boundaries from the NOAA Office of Coast Survey and the Minerals Management Service. Authoritative marine boundary data and information is available from the Agency of Responsibility, although often only at a regional-scale and in a number of differing and disparate formats. The NOAA Coastal Services Center, working in conjunction with the Agency of Responsibility through the FGDC Marine Boundary Working Group, has compiled and standardized several datasets to create a national-scale, standardized data set based on several OGC and FGDC standards. Such datasets help to alleviate the initial cost of data mining, collecting, and the processing necessary to utilize such datasets for a variety of uses. Not for use for scales greater than 1:25,000.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the
e-print arXiv and covers all the citations within a dataset of 34,546 papers
with 421,578 edges. If a paper i cites paper j, the graph contains a directed
edge from i to j. If a paper cites, or is cited by, a paper outside the
dataset, the graph does not contain any information about this.
The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus
represents essentially the complete history of its HEP-PH section.
The data was originally released as a part of 2003 KDD Cup.
Dataset statistics
Nodes 34546
Edges 421578
Nodes in largest WCC 34401 (0.996)
Edges in largest WCC 421485 (1.000)
Nodes in largest SCC 12711 (0.368)
Edges in largest SCC 139981 (0.332)
Average clustering coefficient 0.2962
Number of triangles 1276868
Fraction of closed triangles 0.1457
Diameter (longest shortest path) 12
90-percentile effective diameter 5
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
J. Gehrke, P. Ginsparg, J. M. Kleinberg. Overview of the 2003 KDD Cup. SIGKDD
Explorations 5(2): 149-151, 2003.
Files
File Description
cit-HepPh.txt.gz Paper citation network of Arxiv High Energy Physics category
cit-HepPh-dates.txt.gz Time of nodes (paper submission time to Arxiv)
Dataset information
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print
arXiv and covers all the citations within a dataset of 27,770 papers with
352,807 edges. If a paper i cites paper j, the graph contains a directed edge
from i to j. If a paper cites, or is cited by, a paper outside the dataset, the
graph does not contain any information about this.
The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus
represents essentially the complete history of its HEP-TH section.
The data was originally released as a part of 2003 KDD Cup.
Dataset statistics
Nodes 27770
Edges 352807
Nodes in largest WCC 27400 (0.987) ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Gowalla is a location-based social networking website where users share their locations by checking-in.
Time and location information of check-ins made by users.
This data set is available from https://snap.stanford.edu/data/loc-gowalla.html
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
467 million Twitter posts from 20 million users covering a 7 month period
from June 1 2009 to December 31 2009. We estimate this is about 20-30% of
all public tweets published on Twitter during the particular time frame.
For each public tweet the following information was available:
Author
Time
Content
We have no Twitter social graph (who-follows-whom graph) available. You can
find a copy of the graph at http://an.kaist.ac.kr/traces/WWW2010.html
(thanks to Haewoon Kwak, et al.).
Dataset statistics
Number of users 17,069,982
Number of tweets 476,553,560
Number of URLs 181,611,080
Number of Hashtags 49,293,684
Number of re-tweets 71,835,017
Source (citation)
J. Yang, J. Leskovec. Temporal Variation in Online Media. ACM Intl.
Conf. on Web Search and Data Mining (WSDM '11), 2011.
As per request from Twitter the data is no longer available.
What is Twitter, a Social Network or a News Media?
Haewoon Kwak (http://an.kaist.ac.kr/~haewoon),
Changhyun Lee (http://an.kaist.ac.kr/~chlee),
Hosung Park (http://an.kaist.ac.kr/~hosung),
and Sue Moon (http://an.kaist.ac.kr/~sbmoon)
Proceedings of the 19th International World Wide Web (WWW) Conference,
April 26-30, 2010, Raleigh NC (USA)
Twitter, a microblogging service less than three years old, commands more
than 41 million users as of July 2009 and is growing fast. Twitter users
tweet about any topic within the 140-character limit and follow others to
receive their tweets. The goal of this paper is to study the topological
characteristics of Twitter and its power as a new medium of information
sharing.
We have crawled the entire Twitter site and obtained 41.7 million user
profiles, 1.47 billion social relations, 4,262 trending topics, and 106
million tweets. In its follower-following topology analysis we have found a
non-power-law follower distribution, a short effective diameter, and low
reciprocity, which all mark a deviation from known characteristics of human
social networks~\cite{Newman03}. In order to identify influentials on
Twitter, we have ranked users by the number of followers and by PageRank
and found two rankings to be similar. Ranking by retweets differs from the
previous two rankings, indicating a gap in influence inferred from the
number of followers and that from the popularity of one's tweets. We have
analyzed the tweets of top trending topics and reported on their temporal
behavior and user participation. We have classified the trending topics
based on the active period and th...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains 122 CAIDA AS graphs, from January 2004 to November 2007 - http://www.caida.org/data/active/as-relationships/ . Each file contains a full AS graph derived from a set of RouteViews BGP table snapshots.
Dataset statistics are calculated for the graph with the highest number of
nodes - dataset from November 5 2007. Dataset statistics for graph with
highest number of nodes - 11 5 2007
Nodes 26475
Edges 106762
Nodes in largest WCC 26475 (1.000)
Edges in largest WCC 106762 (1.000)
Nodes in largest SCC 26475 (1.000)
Edges in largest SCC 106762 (1.000)
Average clustering coefficient 0.2082
Number of triangles 36365
Fraction of closed triangles 0.007319
Diameter (longest shortest path) 17
90-percentile effective diameter 4.6
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
Files
File Description
as-caida20071105.txt.gz CAIDA AS graph from November 5 2007
as-caida.tar.gz 122 CAIDA AS graphs from January 2004 to November 2007
NOTE for UF Sparse Matrix Collection: these graphs are weighted. In the
original SNAP data set, the edge weights are in the set {-1, 0, 1, 2}. Note
that "0" is an edge weight. This can be handled in the UF collection for the
primary sparse matrix in a Problem, but not when the matrices are in a sequence
in the Problem.aux MATLAB struct. The entries with zero edge weight would
become lost. To correct for this, the weights are modified by adding 2 to each
weight. This preserves the structure of the original graphs, so that edges
with weight zero are not lost. (A non-edge is not the same as an edge with
weight zero in this problem).
old new weights:
-1 1
0 2
1 3
2 4
So to obtain the original weights, subtract 2 from each entry.
The primary sparse matrix for this problem is the as-caida20071105 matrix, or
Problem.aux.G{121}, the second-to-the-last graph in the sequence.
The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do not come and go. A node that is "gone" simply has no edges.
This is to allow comparisons across each node in the graphs.
Problem.aux.nodenames gives the node numbers of the original problem. So
row/column i in the matrix is always node number Problem.aux.nodenames(i) in
all the graphs.
Problem.aux.G{k} is the kth graph in the sequence.
Problem.aux.Gname(k,:) is the name of the kth graph.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This database contains a subset of the Memetracker dataset collected by SNAP.
The full Memetracker dataset has observations broken into months. Because of size considerations, however, this version consists of one-half of a month: the first 15 days of Memetracker observations from November 2008.
Memetracker tracks the quotes and phrases that appear most frequently over time across the entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly.
Overall Memetracker tracks more than 17 million different phrases and about 54% of the total phrase/quote mentions appear on blogs and 46% in news media.
This dataset was collected by the Stanford Network Analysis Project. Detailed information about the data and its analysis can be found at the website here.
An analysis of this dataset was published here:
J. Leskovec, L. Backstrom, J. Kleinberg. Meme-tracking and the Dynamics of the News Cycle. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2009.
The SQLite database contains three tables:
articles: 4,542,920 records, with the following fields:
quotes: 7,956,125 records, with the following fields:
links: 16,727,125 records, with the following fields:
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://snap.stanford.edu/data/loc-Brightkite.html
Dataset information
Brightkite (http://www.brightkite.com/) was once a location-based social
networking service provider where users shared their locations by
checking-in. The friendship network was collected using their public API,
and consists of 58,228 nodes and 214,078 edges. The network is originally
directed but we have constructed a network with undirected edges when there
is a friendship in both ways. We have also collected a total of 4,491,143
checkins of these users over the period of Apr. 2008 - Oct. 2010.
Dataset statistics
Nodes 58,228
Edges 214,078
Nodes in largest WCC 56739 (0.974)
Edges in largest WCC 212945 (0.995)
Nodes in largest SCC 56739 (0.974)
Edges in largest SCC 212945 (0.995)
Average clustering coefficient 0.1723
Number of triangles 494728
Fraction of closed triangles 0.03979
Diameter (longest shortest path) 16
90-percentile effective diameter 6
Checkins 4,491,143
Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf
Files
File Description
loc-brightkite_edges.txt.gz Friendship network of Brightkite users
loc-brightkite_totalCheckins.txt.gz
Time and location information of check-ins made by users
Example of check-in information
[user][check-in time] [latitude] [longitude] [location id]
58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411
58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411
58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e
58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e
58188 2010-04-06T06:45:19Z 46.521389 14.854444 ddaa40aaa22411
58188 2008-12-30T15:30:08Z 46.522621 14.849618 58e12bc0d67e11
58189 2009-04-08T07:36:46Z 46.554722 15.646667 ddaf9c4ea22411
58190 2009-04-08T07:01:28Z 46.421389 15.869722 dd793f96a22411
The SNAP data set is 0-based, with nodes numbered 0 to 58,227.
In the SuiteSparse Matrix Collection the graph is converted to 1-based.
The Problem.A matrix is the undirected friendship network, where
A(i,j)=1 if person 1+i and person 1+j are friends in the SNAP data set.
There are 4,747,287 checkins in the loc-brightkite_totalCheckins.txt
file, but 6 lines are empty with a user id but no other data (those
are discarded here). In the SuiteSparse Matrix Collection, the checkin
data is held in 5 vectors of length 4,747,281. These are in the
Problem.aux component of the MATLAB struct. The kth entry of each of
these vectors holds the data in the kth line of the
loc-brightkite_totalCheckins.txt file (after deleting the 6 empty lines).
userid: the SNAP user id is an integer in the range 0 to 58,227. It
has been incremented by one, here, to reflect the corresponding
row and column of the Problem.A matrix. It contains 51,406
unique user id's.
checkin_time: a string of length 20
latitude: a double precision number
longitude: a double precision number
location_id: a string of length 61.
https://snap.stanford.edu/data/loc-Gowalla.html
Dataset information
Gowalla (http://www.gowalla.com/) is a location-based social networking
website where users share their locations by checking-in. The friendship
network is undirected and was collected using their public API, and
consists of 196,591 nodes and 950,327 edges. We have collected a total of
6,442,890 check-ins of these users over the period of Feb. 2009 - Oct.
2010.
Dataset statistics
Nodes 196,591
Edges 950,327
Nodes in largest WCC 196591 (1.000)
Edges in largest WCC 950327 (1.000)
Nodes in largest SCC 196591 (1.000)
Edges in largest SCC 950327 (1.000)
Average clustering coefficient 0.2367
Number of triangles 2273138
Fraction of closed triangles 0.007952
Diameter (longest shortest path) 14
90-percentile effective diameter 5.7
Check-ins 6,442,890
Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf
Files
File Description
loc-gowalla_edges.txt.gz Friendship network of Gowalla users
loc-gowalla_totalCheckins.txt.gz Time and location information
of check-ins made by users
Example of check-in information
[user] [check-in time] [latitude] [longitude] [location id]
196514 2010-07-24T13:45:06Z 53.3648119 -2.2723465833 145064
196514 2010-07-24T13:44:58Z 53.360511233 -2.276369017 1275991
196514 2010-07-24T13:44:46Z 53.3653895945 -2.2754087046 376497
196514 2010-07-24T13:44:38Z 53.3663709833 -2.2700764333 98503
196514 2010-07-24T13:44:26Z 53.3674087524 -2.2783813477 1043431
196514 2010-07-24T13:44:08Z 53.3675663377 -2.278631763 881734
196514 2010-07-24T13:43:18Z 53.3679640626 -2.2792943689 207763
196514 2010-07-24T13:41:10Z 53.364905 -2.270824 1042822