20 datasets found

Location Graphs (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Location Graphs (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-loc
Explore at:
zip(163822208 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
loc-Brightkite

https://snap.stanford.edu/data/loc-Brightkite.html

Dataset information

Brightkite (http://www.brightkite.com/) was once a location-based social
networking service provider where users shared their locations by
checking-in. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges. The network is originally
directed but we have constructed a network with undirected edges when there is a friendship in both ways. We have also collected a total of 4,491,143
checkins of these users over the period of Apr. 2008 - Oct. 2010.

Dataset statistics
Nodes 58,228
Edges 214,078
Nodes in largest WCC 56739 (0.974)
Edges in largest WCC 212945 (0.995)
Nodes in largest SCC 56739 (0.974)
Edges in largest SCC 212945 (0.995)
Average clustering coefficient 0.1723
Number of triangles 494728
Fraction of closed triangles 0.03979
Diameter (longest shortest path) 16
90-percentile effective diameter 6
Checkins 4,491,143

Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf

Files
File Description
loc-brightkite_edges.txt.gz Friendship network of Brightkite users
loc-brightkite_totalCheckins.txt.gz
Time and location information of check-ins made by users

Example of check-in information

[user][check-in time] [latitude] [longitude] [location id] 58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411 58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411 58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411 58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411 58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8 58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8 58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e 58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e 58188 2010-04-06T06:45:19Z 46.521389 14.854444 ddaa40aaa22411 58188 2008-12-30T15:30:08Z 46.522621 14.849618 58e12bc0d67e11 58189 2009-04-08T07:36:46Z 46.554722 15.646667 ddaf9c4ea22411 58190 2009-04-08T07:01:28Z 46.421389 15.869722 dd793f96a22411

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

The SNAP data set is 0-based, with nodes numbered 0 to 58,227.

In the SuiteSparse Matrix Collection the graph is converted to 1-based.
The Problem.A matrix is the undirected friendship network, where
A(i,j)=1 if person 1+i and person 1+j are friends in the SNAP data set.

There are 4,747,287 checkins in the loc-brightkite_totalCheckins.txt
file, but 6 lines are empty with a user id but no other data (those
are discarded here). In the SuiteSparse Matrix Collection, the checkin
data is held in 5 vectors of length 4,747,281. These are in the
Problem.aux component of the MATLAB struct. The kth entry of each of
these vectors holds the data in the kth line of the
loc-brightkite_totalCheckins.txt file (after deleting the 6 empty lines).

userid: the SNAP user id is an integer in the range 0 to 58,227. It has been incremented by one, here, to reflect the corresponding row and column of the Problem.A matrix. It contains 51,406 unique user id's. checkin_time: a string of length 20 latitude: a double precision number longitude: a double precision number location_id: a string of length 61.

loc-Gowalla

https://snap.stanford.edu/data/loc-Gowalla.html

Dataset information

Gowalla (http://www.gowalla.com/) is a location-based social networking
website where users share their locations by checking-in. The friendship
network is undirected and was collected using their public API, and
consists of 196,591 nodes and 950,327 edges. We have collected a total of
6,442,890 check-ins of these users over the period of Feb. 2009 - Oct.
2010.

Dataset statistics
Nodes 196,591
Edges 950,327
Nodes in largest WCC 196591 (1.000)
Edges in largest WCC 950327 (1.000)
Nodes in largest SCC 196591 (1.000)
Edges in largest SCC 950327 (1.000)
Average clustering coefficient 0.2367
Number of triangles 2273138
Fraction of closed triangles 0.007952
Diameter (longest shortest path) 14
90-percentile effective diameter 5.7
Check-ins 6,442,890

Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf

Files
File Description
loc-gowalla_edges.txt.gz Friendship network of Gowalla users
loc-gowalla_totalCheckins.txt.gz Time and location information
of check-ins made by users

Example of check-in information

[user] [check-in time] [latitude] [longitude] [location id] 196514 2010-07-24T13:45:06Z 53.3648119 -2.2723465833 145064 196514 2010-07-24T13:44:58Z 53.360511233 -2.276369017 1275991 196514 2010-07-24T13:44:46Z 53.3653895945 -2.2754087046 376497 196514 2010-07-24T13:44:38Z 53.3663709833 -2.2700764333 98503 196514 2010-07-24T13:44:26Z 53.3674087524 -2.2783813477 1043431 196514 2010-07-24T13:44:08Z 53.3675663377 -2.278631763 881734 196514 2010-07-24T13:43:18Z 53.3679640626 -2.2792943689 207763 196514 2010-07-24T13:41:10Z 53.364905 -2.270824 1042822
Autonomous System Graphs (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Autonomous System Graphs (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-as
Explore at:
zip(94677378 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Autonomous systems - Oregon-1

Dataset information

9 graphs of Autonomous Systems (AS) peering information inferred from Oregon
route-views between March 31 2001 and May 26 2001.

Dataset statistics are calculated for the graph with the lowest (March 31 2001) and highest (from May 26 2001) number of nodes: Dataset statistics for graph
witdh lowest number of nodes - 3 31 2001)

Nodes 10670
Edges 22002
Nodes in largest WCC 10670 (1.000)
Edges in largest WCC 22002 (1.000)
Nodes in largest SCC 10670 (1.000)
Edges in largest SCC 22002 (1.000)
Average clustering coefficient 0.4559
Number of triangles 17144
Fraction of closed triangles 0.009306
Diameter (longest shortest path) 9
90-percentile effective diameter 4.5

Dataset statistics for graph with highest number of nodes - 5 26 2001

Nodes 11174
Edges 23409
Nodes in largest WCC 11174 (1.000)
Edges in largest WCC 23409 (1.000)
Nodes in largest SCC 11174 (1.000)
Edges in largest SCC 23409 (1.000)
Average clustering coefficient 0.4532
Number of triangles 19894
Fraction of closed triangles 0.009636
Diameter (longest shortest path) 10
90-percentile effective diameter 4.4

Source (citation)

J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.

Files
File Description
* AS peering information inferred from Oregon route-views ...
oregon1_010331.txt.gz from March 31 2001
oregon1_010407.txt.gz from April 7 2001
oregon1_010414.txt.gz from April 14 2001
oregon1_010421.txt.gz from April 21 2001
oregon1_010428.txt.gz from April 28 2001
oregon1_010505.txt.gz from May 05 2001
oregon1_010512.txt.gz from May 12 2001
oregon1_010519.txt.gz from May 19 2001
oregon1_010526.txt.gz from May 26 2001

NOTE: for the UF Sparse Matrix Collection, the primary matrix in this problem
set (Problem.A) is the last matrix in the sequence, oregon1_010526, from May 26 2001.

The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do...
S
Electronic Medical Record Data-Mining
simtk.org
Updated Sep 26, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Chen (2017). Electronic Medical Record Data-Mining [Dataset]. https://simtk.org/frs/?group_id=892
Explore at:
data/images/video(5 MB), application/x-zip-compressed(1 MB), source code(1 MB)Available download formats
Dataset updated
Sep 26, 2017
Dataset provided by
Stanford
Authors
Jonathan Chen
Description
EMR data-mining code such as association rules for order recommendations and outcome predictions and order set evaluation

This project includes the following software/data packages:

Order Sets and Topic Models : Application code and support script to reproduce topic model and order set prediction evaluations as published in JAMIA 2016 manuscript.

ICU DNR : Data underlying paper: "Reversals and Limitations of High-Intensity, Life-Sustaining Treatments" regarding clinical factors associated with DNR and Comfort Care orders in the ICU

Item Association Code PSB 2016
Bitcoin Trust Weighted Signed Networks (SNAP)
kaggle.com
zip
Updated Jan 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2022). Bitcoin Trust Weighted Signed Networks (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-soc-sign-bitcoin
Explore at:
zip(2209890 bytes)Available download formats
Dataset updated
Jan 2, 2022
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bitcoin Alpha trust weighted signed network

https://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html

Dataset information

This is who-trusts-whom network of people who trade using Bitcoin on a
platform called Bitcoin Alpha (http://www.btcalpha.com/). Since Bitcoin
users are anonymous, there is a need to maintain a record of users'
reputation to prevent transactions with fraudulent and risky users. Members of Bitcoin Alpha rate other members in a scale of -10 (total distrust) to
+10 (total trust) in steps of 1. This is the first explicit weighted signed directed network available for research.

Dataset statistics
Nodes 3,783
Edges 24,186
Range of edge weight -10 to +10
Percentage of positive edges 93%

Similar network from another Bitcoin platform, Bitcoin OTC, is available at https://snap.stanford.edu/data/soc-sign-bitcoinotc.html (and as
SNAP/bitcoin-otc in the SuiteSparse Matrix Collection).

Source (citation) Please cite the following paper if you use this dataset: S. Kumar, F. Spezzano, V.S. Subrahmanian, C. Faloutsos. Edge Weight
Prediction in Weighted Signed Networks. IEEE International Conference on
Data Mining (ICDM), 2016.
http://cs.stanford.edu/~srijan/pubs/wsn-icdm16.pdf

The following BibTeX citation can be used:
@inproceedings{kumar2016edge,
title={Edge weight prediction in weighted signed networks},
author={Kumar, Srijan and Spezzano, Francesca and
Subrahmanian, VS and Faloutsos, Christos},
booktitle={Data Mining (ICDM), 2016 IEEE 16th Intl. Conf. on},
pages={221--230},
year={2016},
organization={IEEE}
}

The project webpage for this paper, along with its code to calculate two
signed network metrics---fairness and goodness---is available at
http://cs.umd.edu/~srijan/wsn/

Files
File Description
soc-sign-bitcoinalpha.csv.gz
Weighted Signed Directed Bitcoin Alpha web of trust network

Data format
Each line has one rating with the following format:

SOURCE, TARGET, RATING, TIME

where

SOURCE: node id of source, i.e., rater TARGET: node id of target, i.e., ratee RATING: the source's rating for the target, ranging from -10 to +10 in steps of 1 TIME: the time of the rating, measured as seconds since Epoch.

Notes on inclusion into the Suite...
d
Anomaly Detection with Text Mining
catalog.data.gov
s.cnmilf.com
+2more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Anomaly Detection with Text Mining [Dataset]. https://catalog.data.gov/dataset/anomaly-detection-with-text-mining
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The problem that we address in this paper is the discovery of recurring anomalies and relationships between problem reports that may indicate larger systemic problems. We will illustrate our techniques on data from discrepancy reports regarding software anomalies in the Space Shuttle. These free text reports are written by a number of different people, thus the emphasis and wording vary considerably. With Mehran Sahami from Stanford University, I'm putting together a book on text mining called "Text Mining: Theory and Applications" to be published by Taylor and Francis.
North American Indian Drama, 2nd Edition [full text data]
stanford.redivis.com
redivis.com
application/jsonl +7
Updated Feb 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University Libraries (2023). North American Indian Drama, 2nd Edition [full text data] [Dataset]. http://doi.org/10.57761/wqj6-jm04
Explore at:
arrow, parquet, application/jsonl, sas, csv, stata, avro, spssAvailable download formats
Unique identifier
https://doi.org/10.57761/wqj6-jm04
Dataset updated
Feb 18, 2023
Dataset provided by
Redivis Inc.
Authors
Stanford University Libraries
Time period covered
Oct 27, 2022 - Dec 16, 2022
Description
Abstract

This collection includes the text and image data underlying Alexander Street Press' North American Indian Drama, 2nd Edition. The collection contains 244 plays by American Indian, First Nation, and Pacific Islander playwrights of the 20th century, as well as issues of the Native Playwrights' Newsletter. The collection represents groups across the United States and Canada, including Cherokee, Métis, Creek, Choctaw, Pembina Chippewa, Ojibway, Lenape, Comanche, Cree, Navajo, Rappahannock, Hawaiian/Samoan, and others.

Usage

For a complete list of titles, please see INDR-2E_metadata_FINAL.xlsx (under Supporting files).

This deposit is Stanford Libraries’ local copy of North American Indian Drama, 2nd Edition, which may be used for data mining. If your research requires you to read the text to understand or analyze it, you may use the corresponding ProQuest database, available under Links.
Dataset for Conflicting Statements Detection in Text
figshare.com
zip
Updated Feb 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vijay Lingam; Simran Bhuria; Mayukh Nair; Divij Gurpreetsingh; Anjali Goyal; Ashish Sureka (2018). Dataset for Conflicting Statements Detection in Text [Dataset]. http://doi.org/10.6084/m9.figshare.5873823.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5873823.v1
Dataset updated
Feb 9, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Vijay Lingam; Simran Bhuria; Mayukh Nair; Divij Gurpreetsingh; Anjali Goyal; Ashish Sureka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The files are from three different. One of the three datasets (SemEval) is downloaded from SemEval-2014 which was an international workshop on semantic evaluation conducted in Dublin (Ireland). Another dataset is same dataset (Stanford) as used by Marneffe et al. for their work on finding contradictions in text. Another dataset that we use is the PHEME RTE (Recognizing Textual Entailment). The attached dataset consists of annotated dataset into four different types of contradictions. It consists of intermediate results and feature values on our work on conflicting statements detection in text.
Stack Exchange Graphs (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Stack Exchange Graphs (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-sx
Explore at:
zip(1480133729 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ask Ubuntu temporal network

https://snap.stanford.edu/data/sx-askubuntu.html

Dataset information

This is a temporal network of interactions on the stack exchange web site
Ask Ubuntu (http://askubuntu.com/). There are three different types of
interactions represented by a directed edge (u, v, t):

user u answered user v's question at time t (in the graph sx-askubuntu-a2q) user u commented on user v's question at time t (in the graph
sx-askubuntu-c2q) user u commented on user v's answer at time t (in the
graph sx-askubuntu-c2a)

The graph sx-askubuntu contains the union of these graphs. These graphs
were constructed from the Stack Exchange Data Dump. Node ID numbers
correspond to the 'OwnerUserId' tag in that data dump.

Dataset statistics (sx-askubuntu)
Nodes 159,316
Temporal Edges 964,437
Edges in static graph 596,933
Time span 2613 days

Dataset statistics (sx-askubuntu-a2q)
Nodes 137,517
Temporal Edges 280,102
Edges in static graph 262,106
Time span 2613 days

Dataset statistics (sx-askubuntu-c2q)
Nodes 79,155
Temporal Edges 327,513
Edges in static graph 198,852
Time span 2047 days

Dataset statistics (sx-askubuntu-c2a)
Nodes 75,555
Temporal Edges 356,822
Edges in static graph 178,210
Time span 2418 days

Source (citation)
Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. "Motifs in Temporal Networks." In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017.

Files
File Description
sx-askubuntu.txt.gz All interactions
sx-askubuntu-a2q.txt.gz Answers to questions
sx-askubuntu-c2q.txt.gz Comments to questions
sx-askubuntu-c2a.txt.gz Comments to answers

Data format

SRC DST UNIXTS

where edges are separated by a new line and

SRC: id of the source node (a user) TGT: id of the target node (a user) UNIXTS: Unix timestamp (seconds since the epoch) ...
733 instances of Autonomous systems traffic (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). 733 instances of Autonomous systems traffic (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-as-735
Explore at:
zip(19603389 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The graph of routers comprising the Internet can be organized into sub-graphs called Autonomous Systems (AS). Each AS exchanges traffic flows with some
neighbors (peers). We can construct a communication network of who-talks-to-
whom from the BGP (Border Gateway Protocol) logs.

The data was collected from University of Oregon Route Views Project
(http://www.routeviews.org/) - Online data and reports. The dataset contains
735 daily instances which span an interval of 785 days from November 8 1997 to January 2 2000. In contrast to citation networks, where nodes and edges only
get added (not deleted) over time, the AS dataset also exhibits both the
addition and deletion of the nodes and edges over time.

Dataset statistics are calculated for the graph with the highest number of
nodes and edges (dataset from January 02 2000):

Dataset statistics
Nodes 6474
Edges 13233
Nodes in largest WCC 6474 (1.000)
Edges in largest WCC 13233 (1.000)
Nodes in largest SCC 6474 (1.000)
Edges in largest SCC 13233 (1.000)
Average clustering coefficient 0.3913
Number of triangles 6584
Fraction of closed triangles 0.009591
Diameter (longest shortest path) 9
90-percentile effective diameter 4.6

Source (citation)

J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2005.

Files
File Description
as20000102.txt.gz Autonomous Systems graph from January 02 2000
as.tar.gz 735 Autonomous Systems graphs from November 8 1997 to
January 02 2000

NOTE: In the UF collection, the primary matrix (Problem.A) is the
as20000102 matrix from January 02 2000 (the last graph in the sequence).

The nodes are uniform across all graphs in the sequence in the UF collection. That is, nodes do not come and go. A node that is "gone" simply has no edges. This is to allow comparisons across each node in the graphs.
Problem.aux.nodenames gives the node numbers of the original problem. So
row/column i in the matrix is always node number Problem.aux.nodenames(i) in
all the graphs.

Problem.aux.G{k} is the kth graph in the sequence.
Problem.aux.Gname(k,:) is the name of the kth graph.
Counseling and Psychotherapy Transcripts: Volume II [full text data]
redivis.com
application/jsonl +7
Updated Feb 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University Libraries (2023). Counseling and Psychotherapy Transcripts: Volume II [full text data] [Dataset]. http://doi.org/10.57761/zh3g-ch31
Explore at:
application/jsonl, arrow, spss, avro, stata, csv, parquet, sasAvailable download formats
Unique identifier
https://doi.org/10.57761/zh3g-ch31
Dataset updated
Feb 17, 2023
Dataset provided by
Redivis Inc.
Authors
Stanford University Libraries
Time period covered
Feb 13, 2023
Description
Abstract

This collection contains the plain text transcripts of therapy sessions. These transcripts are sourced from Alexander Street Press' Counseling and Psychotherapy Transcripts: Volume II. The collection features a diverse set of clients, a wide range of presenting issues, and multiple therapeutic approaches. Content was recorded in 2012 or later, and the transcripts were generally released between 2013 and 2015. The collection adheres to the American Psychological Association's Ethics Guidelines for use and anonymity.

Usage

For a complete list of transcripts, please see CTRN Metadata_QA completed by WS 1.6.21.xlsx (under Supporting files).

This deposit is Stanford Libraries’ local copy of Counseling and Psychotherapy Transcripts: Volume II, which may be used for data mining. If your research requires you to read the text to understand or analyze it, you may use the corresponding ProQuest database, available under Links.
Mining TCGA Data Using Boolean Implications
plos.figshare.com
tiff
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subarna Sinha; Emily K. Tsang; Haoyang Zeng; Michela Meister; David L. Dill (2023). Mining TCGA Data Using Boolean Implications [Dataset]. http://doi.org/10.1371/journal.pone.0102119
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0102119
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Subarna Sinha; Emily K. Tsang; Haoyang Zeng; Michela Meister; David L. Dill
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Boolean implications (if-then rules) provide a conceptually simple, uniform and highly scalable way to find associations between pairs of random variables. In this paper, we propose to use Boolean implications to find relationships between variables of different data types (mutation, copy number alteration, DNA methylation and gene expression) from the glioblastoma (GBM) and ovarian serous cystadenoma (OV) data sets from The Cancer Genome Atlas (TCGA). We find hundreds of thousands of Boolean implications from these data sets. A direct comparison of the relationships found by Boolean implications and those found by commonly used methods for mining associations show that existing methods would miss relationships found by Boolean implications. Furthermore, many relationships exposed by Boolean implications reflect important aspects of cancer biology. Examples of our findings include cis relationships between copy number alteration, DNA methylation and expression of genes, a new hierarchy of mutations and recurrent copy number alterations, loss-of-heterozygosity of well-known tumor suppressors, and the hypermethylation phenotype associated with IDH1 mutations in GBM. The Boolean implication results used in the paper can be accessed at http://crookneck.stanford.edu/microarray/TCGANetworks/.
e
Data from: Academic offer of advanced digital technologies
data.europa.eu
html, zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joint Research Centre (2023). Academic offer of advanced digital technologies [Dataset]. https://data.europa.eu/data/datasets/7aed1a89-c904-43ed-af0f-b024fc9cb92a?locale=bg
Explore at:
zip, htmlAvailable download formats
Dataset updated
Jun 7, 2023
Dataset authored and provided by
Joint Research Centre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is the result of a project to support policy making by providing insights on the availability and composition of education offer in four key digital domains: artificial intelligence, high performance computing, cybersecurity, and data science. Following a text mining methodology that captures the inclusion of advanced digital technologies in the programmes’ syllabus, we monitor the availability of masters’ programmes, bachelor’s programmes and short professional courses and study their characteristics. These include the scope or depth with which the digital content is taught (classified into broad or specialised), education fields in which digital technologies are embedded (e.g., Information and communication technologies, Business, administration and law), and the content areas covered by the programmes (e.g. robotics, machine learning). Also, we consider the overlap between the four domains, to identify complementarities and synergies in the academic offer of advanced digital technologies. The dataset covers yearly data, starting from the academic year 2019-2020 and ending in academic year 2023-24 (and will not be further updated). In order to provide comparison with other competing economies, the dataset covers the EU and its Member States plus six additional countries: the United Kingdom, Norway, Switzerland, Canada, the United States, and Australia. Results of the study have been used as reference in the European Artificial Intelligence Strategy, the White Paper on Artificial Intelligence – a European approach to excellence and trust, in the Stanford University’s Artificial Intelligence Index Report 2019 and 2021. These data have substantiated the assessment of the national Recovery and Resilience plans, and are used as input for the Digital Resilience Dashboard, among others.
Autonomous System Graphs by Skitter (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Autonomous System Graphs by Skitter (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-as-skitter
Explore at:
zip(33194649 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Autonomous systems by Skitter

Dataset information

Internet topology graph. From traceroutes run daily in 2005 -
http://www.caida.org/tools/measurement/skitter. From several scattered sources to million destinations. 1.7 million nodes, 11 million edges.

Dataset statistics
Nodes 1696415
Edges 11095298
Nodes in largest WCC 1694616 (0.999)
Edges in largest WCC 11094209 (1.000)
Nodes in largest SCC 1694616 (0.999)
Edges in largest SCC 11094209 (1.000)
Average clustering coefficient 0.2963
Number of triangles 28769868
Fraction of closed triangles 0.005387
Diameter (longest shortest path) 25
90-percentile effective diameter 5.9

Source (citation)

J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2005.

Files
File Description
as-skitter.txt.gz AS from traceroutes run daily in 2005 by skitter
Data from: A global network of biomedical relationships derived from text
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bethany Percha; Russ B. Altman; Bethany Percha; Russ B. Altman (2020). A global network of biomedical relationships derived from text [Dataset]. http://doi.org/10.5281/zenodo.3459420
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3459420
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bethany Percha; Russ B. Altman; Bethany Percha; Russ B. Altman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains labeled, weighted networks of chemical-gene, gene-gene, gene-disease, and chemical-disease relationships based on single sentences in PubMed abstracts. All raw dependency paths are provided in addition to the labeled relationships.

PART I: Connects dependency paths to labels, or "themes". Each record contains a dependency path followed by its score for each theme, and indicators of whether or not the path is part of the flagship path set for each theme (meaning that it was manually reviewed and determined to reflect that theme). The themes themselves are listed below and are in our paper (reference below).

PART II: Connects sentences to dependency paths. It consists of sentences and associated metadata, entity pairs found in the sentences, and dependency paths connecting those entity pairs. Each record contains the following information:

PubMed ID

Sentence number (0 = title)

First entity name, formatted

First entity name, location (characters from start of abstract)

Second entity name, formatted

Second entity name, location

First entity name, raw string

Second entity name, raw string

First entity name, database ID(s)

Second entity name, database ID(s)

First entity type (Chemical, Gene, Disease)

Second entity type (Chemical, Gene, Disease)

Dependency path

Sentence, tokenized

The "with-themes.txt" files only contain dependency paths with corresponding theme assignments from Part I. The plain ".txt" files contain all dependency paths.

This release contains the annotated network for the September 15, 2019 version of PubTator. The version discussed in our paper, below, is an older one - from April 30, 2016. If you're interested in that network, it can be found in Version 1 of this repository. We will be releasing updated networks periodically, as the PubTator community continues to release new versions of named entity annotations for Medline each month or so.

------------------------------------------------------------------------------------
REFERENCES

Percha B, Altman RBA (2017) A global network of biomedical relationships derived from text. Bioinformatics, 34(15): 2614-2624.
Percha B, Altman RBA (2015) Learning the structure of biomedical relationships from unstructured text. PLoS Computational Biology, 11(7): e1004216.

This project depends on named entity annotations from the PubTator project:
https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

Reference:
Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522.

Dependency parsing was provided by the Stanford CoreNLP toolkit (version 3.9.1):
https://stanfordnlp.github.io/CoreNLP/index.html

Reference:
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

------------------------------------------------------------------------------------
THEMES

chemical-gene
(A+) agonism, activation
(A-) antagonism, blocking
(B) binding, ligand (esp. receptors)
(E+) increases expression/production
(E-) decreases expression/production
(E) affects expression/production (neutral)
(N) inhibits

gene-chemical
(O) transport, channels
(K) metabolism, pharmacokinetics
(Z) enzyme activity

chemical-disease
(T) treatment/therapy (including investigatory)
(C) inhibits cell growth (esp. cancers)
(Sa) side effect/adverse event
(Pr) prevents, suppresses
(Pa) alleviates, reduces
(J) role in disease pathogenesis

disease-chemical
(Mp) biomarkers (of disease progression)

gene-disease
(U) causal mutations
(Ud) mutations affecting disease course
(D) drug targets
(J) role in pathogenesis
(Te) possible therapeutic effect
(Y) polymorphisms alter risk
(G) promotes progression

disease-gene
(Md) biomarkers (diagnostic)
(X) overexpression in disease
(L) improper regulation linked to disease

gene-gene
(B) binding, ligand (esp. receptors)
(W) enhances response
(V+) activates, stimulates
(E+) increases expression/production
(E) affects expression/production (neutral)
(I) signaling pathway
(H) same protein or complex
(Rg) regulation
(Q) production by cell population

------------------------------------------------------------------------------------
FORMATTING NOTE

A few users have mentioned that the dependency paths in the "part-i" files are all lowercase text, whereas those in the "part-ii" files maintain the case of the original sentence. This complicates mapping between the two sets of files.

We kept the part-ii files in the same case as the original sentence to facilitate downstream debugging - it's easier to tell which words in a particular sentence are contributing to the dependency path if their original case is maintained. When working with the part-ii "with-themes" files, if you simply convert the dependency path to lowercase, it is guaranteed to match to one of the paths in the corresponding part-i file and you'll be able to get the theme scores.

Apologies for the additional complexity, and please reach out to us if you have any questions (see correspondence information in the Bioinformatics manuscript, above).
s
Marine Jurisdictions, Northeast United States, 2010
searchworks.stanford.edu
zip
Updated Nov 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Marine Jurisdictions, Northeast United States, 2010 [Dataset]. https://searchworks.stanford.edu/view/zk816hk1203
Explore at:
zipAvailable download formats
Dataset updated
Nov 9, 2021
Area covered
Northeastern United States, United States
Description
a standardized compilation of published marine boundaries from the NOAA Office of Coast Survey and the Minerals Management Service. Authoritative marine boundary data and information is available from the Agency of Responsibility, although often only at a regional-scale and in a number of differing and disparate formats. The NOAA Coastal Services Center, working in conjunction with the Agency of Responsibility through the FGDC Marine Boundary Working Group, has compiled and standardized several datasets to create a national-scale, standardized data set based on several OGC and FGDC standards. Such datasets help to alleviate the initial cost of data mining, collecting, and the processing necessary to utilize such datasets for a variety of uses. Not for use for scales greater than 1:25,000.
Citation Networks (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Citation Networks (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-cit
Explore at:
zip(95620457 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High-energy physics citation network

Dataset information

Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the
e-print arXiv and covers all the citations within a dataset of 34,546 papers
with 421,578 edges. If a paper i cites paper j, the graph contains a directed
edge from i to j. If a paper cites, or is cited by, a paper outside the
dataset, the graph does not contain any information about this.

The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus
represents essentially the complete history of its HEP-PH section.

The data was originally released as a part of 2003 KDD Cup.

Dataset statistics
Nodes 34546
Edges 421578
Nodes in largest WCC 34401 (0.996)
Edges in largest WCC 421485 (1.000)
Nodes in largest SCC 12711 (0.368)
Edges in largest SCC 139981 (0.332)
Average clustering coefficient 0.2962
Number of triangles 1276868
Fraction of closed triangles 0.1457
Diameter (longest shortest path) 12
90-percentile effective diameter 5

Source (citation)

J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.

J. Gehrke, P. Ginsparg, J. M. Kleinberg. Overview of the 2003 KDD Cup. SIGKDD
Explorations 5(2): 149-151, 2003.

Files
File Description
cit-HepPh.txt.gz Paper citation network of Arxiv High Energy Physics category cit-HepPh-dates.txt.gz Time of nodes (paper submission time to Arxiv)

High-energy physics theory citation network

Dataset information

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print
arXiv and covers all the citations within a dataset of 27,770 papers with
352,807 edges. If a paper i cites paper j, the graph contains a directed edge
from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.

The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its HEP-TH section.

The data was originally released as a part of 2003 KDD Cup.

Dataset statistics
Nodes 27770
Edges 352807
Nodes in largest WCC 27400 (0.987) ...
Gowalla Checkins
kaggle.com
zip
Updated Nov 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bqlearner (2017). Gowalla Checkins [Dataset]. https://www.kaggle.com/bqlearner/gowalla-checkins
Explore at:
zip(105113346 bytes)Available download formats
Dataset updated
Nov 15, 2017
Authors
bqlearner
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Gowalla is a location-based social networking website where users share their locations by checking-in.

Content

Time and location information of check-ins made by users.

Acknowledgements

This data set is available from https://snap.stanford.edu/data/loc-gowalla.html

E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011.
Twitter Posts Network (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Twitter Posts Network (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-twitter7/code
Explore at:
zip(6560110658 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
476 million Twitter tweets

Dataset information

467 million Twitter posts from 20 million users covering a 7 month period
from June 1 2009 to December 31 2009. We estimate this is about 20-30% of
all public tweets published on Twitter during the particular time frame.

For each public tweet the following information was available:

Author Time Content

We have no Twitter social graph (who-follows-whom graph) available. You can find a copy of the graph at http://an.kaist.ac.kr/traces/WWW2010.html
(thanks to Haewoon Kwak, et al.).

Dataset statistics
Number of users 17,069,982
Number of tweets 476,553,560
Number of URLs 181,611,080
Number of Hashtags 49,293,684
Number of re-tweets 71,835,017

Source (citation)
J. Yang, J. Leskovec. Temporal Variation in Online Media. ACM Intl.
Conf. on Web Search and Data Mining (WSDM '11), 2011.

As per request from Twitter the data is no longer available.

http://an.kaist.ac.kr/traces/WWW2010.html :

What is Twitter, a Social Network or a News Media?

Haewoon Kwak (http://an.kaist.ac.kr/~haewoon),
Changhyun Lee (http://an.kaist.ac.kr/~chlee),
Hosung Park (http://an.kaist.ac.kr/~hosung),
and Sue Moon (http://an.kaist.ac.kr/~sbmoon)

Proceedings of the 19th International World Wide Web (WWW) Conference,
April 26-30, 2010, Raleigh NC (USA)

Twitter, a microblogging service less than three years old, commands more
than 41 million users as of July 2009 and is growing fast. Twitter users
tweet about any topic within the 140-character limit and follow others to
receive their tweets. The goal of this paper is to study the topological
characteristics of Twitter and its power as a new medium of information
sharing.

We have crawled the entire Twitter site and obtained 41.7 million user
profiles, 1.47 billion social relations, 4,262 trending topics, and 106
million tweets. In its follower-following topology analysis we have found a non-power-law follower distribution, a short effective diameter, and low
reciprocity, which all mark a deviation from known characteristics of human social networks~\cite{Newman03}. In order to identify influentials on
Twitter, we have ranked users by the number of followers and by PageRank
and found two rankings to be similar. Ranking by retweets differs from the previous two rankings, indicating a gap in influence inferred from the
number of followers and that from the popularity of one's tweets. We have
analyzed the tweets of top trending topics and reported on their temporal
behavior and user participation. We have classified the trending topics
based on the active period and th...
122 CAIDA Autonomous systems Graphs (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). 122 CAIDA Autonomous systems Graphs (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-as-caida
Explore at:
zip(40197223 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains 122 CAIDA AS graphs, from January 2004 to November 2007 - http://www.caida.org/data/active/as-relationships/ . Each file contains a full AS graph derived from a set of RouteViews BGP table snapshots.

Dataset statistics are calculated for the graph with the highest number of
nodes - dataset from November 5 2007. Dataset statistics for graph with
highest number of nodes - 11 5 2007

Nodes 26475
Edges 106762
Nodes in largest WCC 26475 (1.000)
Edges in largest WCC 106762 (1.000)
Nodes in largest SCC 26475 (1.000)
Edges in largest SCC 106762 (1.000)
Average clustering coefficient 0.2082
Number of triangles 36365
Fraction of closed triangles 0.007319
Diameter (longest shortest path) 17
90-percentile effective diameter 4.6

Source (citation)

J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.

Files
File Description
as-caida20071105.txt.gz CAIDA AS graph from November 5 2007
as-caida.tar.gz 122 CAIDA AS graphs from January 2004 to November 2007

NOTE for UF Sparse Matrix Collection: these graphs are weighted. In the
original SNAP data set, the edge weights are in the set {-1, 0, 1, 2}. Note
that "0" is an edge weight. This can be handled in the UF collection for the
primary sparse matrix in a Problem, but not when the matrices are in a sequence in the Problem.aux MATLAB struct. The entries with zero edge weight would
become lost. To correct for this, the weights are modified by adding 2 to each weight. This preserves the structure of the original graphs, so that edges
with weight zero are not lost. (A non-edge is not the same as an edge with
weight zero in this problem).

old new weights: -1 1 0 2 1 3 2 4

So to obtain the original weights, subtract 2 from each entry.

The primary sparse matrix for this problem is the as-caida20071105 matrix, or
Problem.aux.G{121}, the second-to-the-last graph in the sequence.

The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do not come and go. A node that is "gone" simply has no edges. This is to allow comparisons across each node in the graphs.
Problem.aux.nodenames gives the node numbers of the original problem. So
row/column i in the matrix is always node number Problem.aux.nodenames(i) in
all the graphs.

Problem.aux.G{k} is the kth graph in the sequence.
Problem.aux.Gname(k,:) is the name of the kth graph.
SNAP Memetracker
kaggle.com
zip
Updated Nov 21, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Network Analysis Project (2016). SNAP Memetracker [Dataset]. https://www.kaggle.com/snap/snap-memetracker
Explore at:
zip(922206150 bytes)Available download formats
Dataset updated
Nov 21, 2016
Dataset authored and provided by
Stanford Network Analysis Project
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This database contains a subset of the Memetracker dataset collected by SNAP.

The full Memetracker dataset has observations broken into months. Because of size considerations, however, this version consists of one-half of a month: the first 15 days of Memetracker observations from November 2008.

About

Memetracker tracks the quotes and phrases that appear most frequently over time across the entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly.

Overall Memetracker tracks more than 17 million different phrases and about 54% of the total phrase/quote mentions appear on blogs and 46% in news media.

Acknowledgments

This dataset was collected by the Stanford Network Analysis Project. Detailed information about the data and its analysis can be found at the website here.

An analysis of this dataset was published here:
J. Leskovec, L. Backstrom, J. Kleinberg. Meme-tracking and the Dynamics of the News Cycle. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2009.

The Data

The SQLite database contains three tables:

articles: 4,542,920 records, with the following fields:

article_id: a unique id for the article (int)

url: the URL of the article (text)

date: the date of the article (text), in the strptime format '%Y-%m-%d %H:%M:%S'

quotes: 7,956,125 records, with the following fields:

article_id: unique id for the article that this quote was found in (int)

phrase: the high-frequency phrase found in the article (text)

links: 16,727,125 records, with the following fields:

article_id: unique id for the article that this link was found in (int)

link_out: the URL of the link out (text)

link_out_id: unique id for the target article (int), if it exists; else NULL
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Subhajit Sahu (2021). Location Graphs (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-loc

Location Graphs (SNAP)

Graphs from the Stanford Network Analysis Platform

Explore at:

zip(163822208 bytes)Available download formats

Dataset updated

Dec 16, 2021

Authors

Subhajit Sahu

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

loc-Brightkite

https://snap.stanford.edu/data/loc-Brightkite.html

Dataset information

Brightkite (http://www.brightkite.com/) was once a location-based social
networking service provider where users shared their locations by
checking-in. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges. The network is originally
directed but we have constructed a network with undirected edges when there is a friendship in both ways. We have also collected a total of 4,491,143
checkins of these users over the period of Apr. 2008 - Oct. 2010.

Dataset statistics
Nodes 58,228
Edges 214,078
Nodes in largest WCC 56739 (0.974)
Edges in largest WCC 212945 (0.995)
Nodes in largest SCC 56739 (0.974)
Edges in largest SCC 212945 (0.995)
Average clustering coefficient 0.1723
Number of triangles 494728
Fraction of closed triangles 0.03979
Diameter (longest shortest path) 16
90-percentile effective diameter 6
Checkins 4,491,143

Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf

Files
File Description
loc-brightkite_edges.txt.gz Friendship network of Brightkite users
loc-brightkite_totalCheckins.txt.gz
Time and location information of check-ins made by users

Example of check-in information

[user][check-in time]   [latitude] [longitude] [location id]    
58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411    
58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411    
58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411    
58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411    
58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8    
58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8    
58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e    
58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e    
58188 2010-04-06T06:45:19Z 46.521389  14.854444 ddaa40aaa22411    
58188 2008-12-30T15:30:08Z 46.522621  14.849618 58e12bc0d67e11    
58189 2009-04-08T07:36:46Z 46.554722  15.646667 ddaf9c4ea22411    
58190 2009-04-08T07:01:28Z 46.421389  15.869722 dd793f96a22411

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

The SNAP data set is 0-based, with nodes numbered 0 to 58,227.

In the SuiteSparse Matrix Collection the graph is converted to 1-based.
The Problem.A matrix is the undirected friendship network, where
A(i,j)=1 if person 1+i and person 1+j are friends in the SNAP data set.

There are 4,747,287 checkins in the loc-brightkite_totalCheckins.txt
file, but 6 lines are empty with a user id but no other data (those
are discarded here). In the SuiteSparse Matrix Collection, the checkin
data is held in 5 vectors of length 4,747,281. These are in the
Problem.aux component of the MATLAB struct. The kth entry of each of
these vectors holds the data in the kth line of the
loc-brightkite_totalCheckins.txt file (after deleting the 6 empty lines).

userid: the SNAP user id is an integer in the range 0 to 58,227. It  
  has been incremented by one, here, to reflect the corresponding  
  row and column of the Problem.A matrix. It contains 51,406    
  unique user id's.                         
checkin_time: a string of length 20                  
latitude: a double precision number                  
longitude: a double precision number                  
location_id: a string of length 61.

loc-Gowalla

https://snap.stanford.edu/data/loc-Gowalla.html

Dataset information

Gowalla (http://www.gowalla.com/) is a location-based social networking
website where users share their locations by checking-in. The friendship
network is undirected and was collected using their public API, and
consists of 196,591 nodes and 950,327 edges. We have collected a total of
6,442,890 check-ins of these users over the period of Feb. 2009 - Oct.
2010.

Dataset statistics
Nodes 196,591
Edges 950,327
Nodes in largest WCC 196591 (1.000)
Edges in largest WCC 950327 (1.000)
Nodes in largest SCC 196591 (1.000)
Edges in largest SCC 950327 (1.000)
Average clustering coefficient 0.2367
Number of triangles 2273138
Fraction of closed triangles 0.007952
Diameter (longest shortest path) 14
90-percentile effective diameter 5.7
Check-ins 6,442,890

Files
File Description
loc-gowalla_edges.txt.gz Friendship network of Gowalla users
loc-gowalla_totalCheckins.txt.gz Time and location information
of check-ins made by users

Example of check-in information

[user] [check-in time]   [latitude]  [longitude] [location id]  
196514 2010-07-24T13:45:06Z 53.3648119  -2.2723465833  145064   
196514 2010-07-24T13:44:58Z 53.360511233 -2.276369017  1275991   
196514 2010-07-24T13:44:46Z 53.3653895945 -2.2754087046  376497   
196514 2010-07-24T13:44:38Z 53.3663709833 -2.2700764333  98503    
196514 2010-07-24T13:44:26Z 53.3674087524 -2.2783813477  1043431   
196514 2010-07-24T13:44:08Z 53.3675663377 -2.278631763  881734   
196514 2010-07-24T13:43:18Z 53.3679640626 -2.2792943689  207763   
196514 2010-07-24T13:41:10Z 53.364905   -2.270824    1042822

Clear search

Close search

Google apps

Main menu

Location Graphs (SNAP)

loc-Brightkite

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

loc-Gowalla

Autonomous System Graphs (SNAP)

Autonomous systems - Oregon-1

Electronic Medical Record Data-Mining

Bitcoin Trust Weighted Signed Networks (SNAP)

Bitcoin Alpha trust weighted signed network

Anomaly Detection with Text Mining

North American Indian Drama, 2nd Edition [full text data]

Abstract

Usage

Dataset for Conflicting Statements Detection in Text

Stack Exchange Graphs (SNAP)

Ask Ubuntu temporal network

733 instances of Autonomous systems traffic (SNAP)

Counseling and Psychotherapy Transcripts: Volume II [full text data]

Abstract

Usage

Mining TCGA Data Using Boolean Implications

Data from: Academic offer of advanced digital technologies

Autonomous System Graphs by Skitter (SNAP)

Autonomous systems by Skitter

Data from: A global network of biomedical relationships derived from text

Marine Jurisdictions, Northeast United States, 2010

Citation Networks (SNAP)

High-energy physics citation network

High-energy physics theory citation network

Gowalla Checkins

Context

Content

Acknowledgements

Twitter Posts Network (SNAP)

476 million Twitter tweets

http://an.kaist.ac.kr/traces/WWW2010.html :

122 CAIDA Autonomous systems Graphs (SNAP)

SNAP Memetracker

About

Acknowledgments

The Data

Location Graphs (SNAP)

Graphs from the Stanford Network Analysis Platform

loc-Brightkite

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

loc-Gowalla