20 datasets found
  1. Location Graphs (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Location Graphs (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-loc
    Explore at:
    zip(163822208 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    loc-Brightkite

    https://snap.stanford.edu/data/loc-Brightkite.html

    Dataset information

    Brightkite (http://www.brightkite.com/) was once a location-based social
    networking service provider where users shared their locations by
    checking-in. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges. The network is originally
    directed but we have constructed a network with undirected edges when there is a friendship in both ways. We have also collected a total of 4,491,143
    checkins of these users over the period of Apr. 2008 - Oct. 2010.

    Dataset statistics
    Nodes 58,228
    Edges 214,078
    Nodes in largest WCC 56739 (0.974)
    Edges in largest WCC 212945 (0.995)
    Nodes in largest SCC 56739 (0.974)
    Edges in largest SCC 212945 (0.995)
    Average clustering coefficient 0.1723
    Number of triangles 494728
    Fraction of closed triangles 0.03979
    Diameter (longest shortest path) 16
    90-percentile effective diameter 6
    Checkins 4,491,143

    Source (citation)
    E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
    Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
    International Conference on Knowledge Discovery and Data Mining (KDD),
    2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf

    Files
    File Description
    loc-brightkite_edges.txt.gz Friendship network of Brightkite users
    loc-brightkite_totalCheckins.txt.gz
    Time and location information of check-ins made by users

    Example of check-in information

    [user][check-in time]   [latitude] [longitude] [location id]    
    58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411    
    58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411    
    58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411    
    58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411    
    58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8    
    58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8    
    58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e    
    58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e    
    58188 2010-04-06T06:45:19Z 46.521389  14.854444 ddaa40aaa22411    
    58188 2008-12-30T15:30:08Z 46.522621  14.849618 58e12bc0d67e11    
    58189 2009-04-08T07:36:46Z 46.554722  15.646667 ddaf9c4ea22411    
    58190 2009-04-08T07:01:28Z 46.421389  15.869722 dd793f96a22411    
    

    Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

    The SNAP data set is 0-based, with nodes numbered 0 to 58,227.

    In the SuiteSparse Matrix Collection the graph is converted to 1-based.
    The Problem.A matrix is the undirected friendship network, where
    A(i,j)=1 if person 1+i and person 1+j are friends in the SNAP data set.

    There are 4,747,287 checkins in the loc-brightkite_totalCheckins.txt
    file, but 6 lines are empty with a user id but no other data (those
    are discarded here). In the SuiteSparse Matrix Collection, the checkin
    data is held in 5 vectors of length 4,747,281. These are in the
    Problem.aux component of the MATLAB struct. The kth entry of each of
    these vectors holds the data in the kth line of the
    loc-brightkite_totalCheckins.txt file (after deleting the 6 empty lines).

    userid: the SNAP user id is an integer in the range 0 to 58,227. It  
      has been incremented by one, here, to reflect the corresponding  
      row and column of the Problem.A matrix. It contains 51,406    
      unique user id's.                         
    checkin_time: a string of length 20                  
    latitude: a double precision number                  
    longitude: a double precision number                  
    location_id: a string of length 61.
    

    loc-Gowalla

    https://snap.stanford.edu/data/loc-Gowalla.html

    Dataset information

    Gowalla (http://www.gowalla.com/) is a location-based social networking
    website where users share their locations by checking-in. The friendship
    network is undirected and was collected using their public API, and
    consists of 196,591 nodes and 950,327 edges. We have collected a total of
    6,442,890 check-ins of these users over the period of Feb. 2009 - Oct.
    2010.

    Dataset statistics
    Nodes 196,591
    Edges 950,327
    Nodes in largest WCC 196591 (1.000)
    Edges in largest WCC 950327 (1.000)
    Nodes in largest SCC 196591 (1.000)
    Edges in largest SCC 950327 (1.000)
    Average clustering coefficient 0.2367
    Number of triangles 2273138
    Fraction of closed triangles 0.007952
    Diameter (longest shortest path) 14
    90-percentile effective diameter 5.7
    Check-ins 6,442,890

    Source (citation)
    E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
    Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
    International Conference on Knowledge Discovery and Data Mining (KDD),
    2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf

    Files
    File Description
    loc-gowalla_edges.txt.gz Friendship network of Gowalla users
    loc-gowalla_totalCheckins.txt.gz Time and location information
    of check-ins made by users

    Example of check-in information

    [user] [check-in time]   [latitude]  [longitude] [location id]  
    196514 2010-07-24T13:45:06Z 53.3648119  -2.2723465833  145064   
    196514 2010-07-24T13:44:58Z 53.360511233 -2.276369017  1275991   
    196514 2010-07-24T13:44:46Z 53.3653895945 -2.2754087046  376497   
    196514 2010-07-24T13:44:38Z 53.3663709833 -2.2700764333  98503    
    196514 2010-07-24T13:44:26Z 53.3674087524 -2.2783813477  1043431   
    196514 2010-07-24T13:44:08Z 53.3675663377 -2.278631763  881734   
    196514 2010-07-24T13:43:18Z 53.3679640626 -2.2792943689  207763   
    196514 2010-07-24T13:41:10Z 53.364905   -2.270824    1042822
    
  2. Autonomous System Graphs (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Autonomous System Graphs (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-as
    Explore at:
    zip(94677378 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Autonomous systems - Oregon-1

    Dataset information

    9 graphs of Autonomous Systems (AS) peering information inferred from Oregon
    route-views between March 31 2001 and May 26 2001.

    Dataset statistics are calculated for the graph with the lowest (March 31 2001) and highest (from May 26 2001) number of nodes: Dataset statistics for graph
    witdh lowest number of nodes - 3 31 2001)

    Nodes 10670
    Edges 22002
    Nodes in largest WCC 10670 (1.000)
    Edges in largest WCC 22002 (1.000)
    Nodes in largest SCC 10670 (1.000)
    Edges in largest SCC 22002 (1.000)
    Average clustering coefficient 0.4559
    Number of triangles 17144
    Fraction of closed triangles 0.009306
    Diameter (longest shortest path) 9
    90-percentile effective diameter 4.5

    Dataset statistics for graph with highest number of nodes - 5 26 2001

    Nodes 11174
    Edges 23409
    Nodes in largest WCC 11174 (1.000)
    Edges in largest WCC 23409 (1.000)
    Nodes in largest SCC 11174 (1.000)
    Edges in largest SCC 23409 (1.000)
    Average clustering coefficient 0.4532
    Number of triangles 19894
    Fraction of closed triangles 0.009636
    Diameter (longest shortest path) 10
    90-percentile effective diameter 4.4

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    Files
    File Description
    * AS peering information inferred from Oregon route-views ...
    oregon1_010331.txt.gz from March 31 2001
    oregon1_010407.txt.gz from April 7 2001
    oregon1_010414.txt.gz from April 14 2001
    oregon1_010421.txt.gz from April 21 2001
    oregon1_010428.txt.gz from April 28 2001
    oregon1_010505.txt.gz from May 05 2001
    oregon1_010512.txt.gz from May 12 2001
    oregon1_010519.txt.gz from May 19 2001
    oregon1_010526.txt.gz from May 26 2001

    NOTE: for the UF Sparse Matrix Collection, the primary matrix in this problem
    set (Problem.A) is the last matrix in the sequence, oregon1_010526, from May 26 2001.

    The nodes are uniform across all graphs in the sequence in the UF collection.
    That is, nodes do...

  3. S

    Electronic Medical Record Data-Mining

    • simtk.org
    Updated Sep 26, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Chen (2017). Electronic Medical Record Data-Mining [Dataset]. https://simtk.org/frs/?group_id=892
    Explore at:
    data/images/video(5 MB), application/x-zip-compressed(1 MB), source code(1 MB)Available download formats
    Dataset updated
    Sep 26, 2017
    Dataset provided by
    Stanford
    Authors
    Jonathan Chen
    Description

    EMR data-mining code such as association rules for order recommendations and outcome predictions and order set evaluation



    This project includes the following software/data packages:

    • Order Sets and Topic Models : Application code and support script to reproduce topic model and order set prediction evaluations as published in JAMIA 2016 manuscript.
    • ICU DNR : Data underlying paper: "Reversals and Limitations of High-Intensity, Life-Sustaining Treatments" regarding clinical factors associated with DNR and Comfort Care orders in the ICU
    • Item Association Code PSB 2016

  4. Bitcoin Trust Weighted Signed Networks (SNAP)

    • kaggle.com
    zip
    Updated Jan 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2022). Bitcoin Trust Weighted Signed Networks (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-soc-sign-bitcoin
    Explore at:
    zip(2209890 bytes)Available download formats
    Dataset updated
    Jan 2, 2022
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bitcoin Alpha trust weighted signed network

    https://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html

    Dataset information

    This is who-trusts-whom network of people who trade using Bitcoin on a
    platform called Bitcoin Alpha (http://www.btcalpha.com/). Since Bitcoin
    users are anonymous, there is a need to maintain a record of users'
    reputation to prevent transactions with fraudulent and risky users. Members of Bitcoin Alpha rate other members in a scale of -10 (total distrust) to
    +10 (total trust) in steps of 1. This is the first explicit weighted signed directed network available for research.

    Dataset statistics
    Nodes 3,783
    Edges 24,186
    Range of edge weight -10 to +10
    Percentage of positive edges 93%

    Similar network from another Bitcoin platform, Bitcoin OTC, is available at https://snap.stanford.edu/data/soc-sign-bitcoinotc.html (and as
    SNAP/bitcoin-otc in the SuiteSparse Matrix Collection).

    Source (citation) Please cite the following paper if you use this dataset: S. Kumar, F. Spezzano, V.S. Subrahmanian, C. Faloutsos. Edge Weight
    Prediction in Weighted Signed Networks. IEEE International Conference on
    Data Mining (ICDM), 2016.
    http://cs.stanford.edu/~srijan/pubs/wsn-icdm16.pdf

    The following BibTeX citation can be used:
    @inproceedings{kumar2016edge,
    title={Edge weight prediction in weighted signed networks},
    author={Kumar, Srijan and Spezzano, Francesca and
    Subrahmanian, VS and Faloutsos, Christos},
    booktitle={Data Mining (ICDM), 2016 IEEE 16th Intl. Conf. on},
    pages={221--230},
    year={2016},
    organization={IEEE}
    }

    The project webpage for this paper, along with its code to calculate two
    signed network metrics---fairness and goodness---is available at
    http://cs.umd.edu/~srijan/wsn/

    Files
    File Description
    soc-sign-bitcoinalpha.csv.gz
    Weighted Signed Directed Bitcoin Alpha web of trust network

    Data format
    Each line has one rating with the following format:

    SOURCE, TARGET, RATING, TIME                      
    

    where

    SOURCE: node id of source, i.e., rater                 
    TARGET: node id of target, i.e., ratee                 
    RATING: the source's rating for the target,              
        ranging from -10 to +10 in steps of 1             
    TIME: the time of the rating, measured as seconds since Epoch.     
    

    Notes on inclusion into the Suite...

  5. d

    Anomaly Detection with Text Mining

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Anomaly Detection with Text Mining [Dataset]. https://catalog.data.gov/dataset/anomaly-detection-with-text-mining
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The problem that we address in this paper is the discovery of recurring anomalies and relationships between problem reports that may indicate larger systemic problems. We will illustrate our techniques on data from discrepancy reports regarding software anomalies in the Space Shuttle. These free text reports are written by a number of different people, thus the emphasis and wording vary considerably. With Mehran Sahami from Stanford University, I'm putting together a book on text mining called "Text Mining: Theory and Applications" to be published by Taylor and Francis.

  6. North American Indian Drama, 2nd Edition [full text data]

    • stanford.redivis.com
    • redivis.com
    application/jsonl +7
    Updated Feb 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford University Libraries (2023). North American Indian Drama, 2nd Edition [full text data] [Dataset]. http://doi.org/10.57761/wqj6-jm04
    Explore at:
    arrow, parquet, application/jsonl, sas, csv, stata, avro, spssAvailable download formats
    Dataset updated
    Feb 18, 2023
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford University Libraries
    Time period covered
    Oct 27, 2022 - Dec 16, 2022
    Description

    Abstract

    This collection includes the text and image data underlying Alexander Street Press' North American Indian Drama, 2nd Edition. The collection contains 244 plays by American Indian, First Nation, and Pacific Islander playwrights of the 20th century, as well as issues of the Native Playwrights' Newsletter. The collection represents groups across the United States and Canada, including Cherokee, Métis, Creek, Choctaw, Pembina Chippewa, Ojibway, Lenape, Comanche, Cree, Navajo, Rappahannock, Hawaiian/Samoan, and others.

    Usage

    For a complete list of titles, please see INDR-2E_metadata_FINAL.xlsx (under Supporting files).

    This deposit is Stanford Libraries’ local copy of North American Indian Drama, 2nd Edition, which may be used for data mining. If your research requires you to read the text to understand or analyze it, you may use the corresponding ProQuest database, available under Links.

  7. Dataset for Conflicting Statements Detection in Text

    • figshare.com
    zip
    Updated Feb 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vijay Lingam; Simran Bhuria; Mayukh Nair; Divij Gurpreetsingh; Anjali Goyal; Ashish Sureka (2018). Dataset for Conflicting Statements Detection in Text [Dataset]. http://doi.org/10.6084/m9.figshare.5873823.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 9, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Vijay Lingam; Simran Bhuria; Mayukh Nair; Divij Gurpreetsingh; Anjali Goyal; Ashish Sureka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The files are from three different. One of the three datasets (SemEval) is downloaded from SemEval-2014 which was an international workshop on semantic evaluation conducted in Dublin (Ireland). Another dataset is same dataset (Stanford) as used by Marneffe et al. for their work on finding contradictions in text. Another dataset that we use is the PHEME RTE (Recognizing Textual Entailment). The attached dataset consists of annotated dataset into four different types of contradictions. It consists of intermediate results and feature values on our work on conflicting statements detection in text.

  8. Stack Exchange Graphs (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Stack Exchange Graphs (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-sx
    Explore at:
    zip(1480133729 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ask Ubuntu temporal network

    https://snap.stanford.edu/data/sx-askubuntu.html

    Dataset information

    This is a temporal network of interactions on the stack exchange web site
    Ask Ubuntu (http://askubuntu.com/). There are three different types of
    interactions represented by a directed edge (u, v, t):

    user u answered user v's question at time t (in the graph sx-askubuntu-a2q) user u commented on user v's question at time t (in the graph
    sx-askubuntu-c2q) user u commented on user v's answer at time t (in the
    graph sx-askubuntu-c2a)

    The graph sx-askubuntu contains the union of these graphs. These graphs
    were constructed from the Stack Exchange Data Dump. Node ID numbers
    correspond to the 'OwnerUserId' tag in that data dump.

    Dataset statistics (sx-askubuntu)
    Nodes 159,316
    Temporal Edges 964,437
    Edges in static graph 596,933
    Time span 2613 days

    Dataset statistics (sx-askubuntu-a2q)
    Nodes 137,517
    Temporal Edges 280,102
    Edges in static graph 262,106
    Time span 2613 days

    Dataset statistics (sx-askubuntu-c2q)
    Nodes 79,155
    Temporal Edges 327,513
    Edges in static graph 198,852
    Time span 2047 days

    Dataset statistics (sx-askubuntu-c2a)
    Nodes 75,555
    Temporal Edges 356,822
    Edges in static graph 178,210
    Time span 2418 days

    Source (citation)
    Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. "Motifs in Temporal Networks." In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017.

    Files
    File Description
    sx-askubuntu.txt.gz All interactions
    sx-askubuntu-a2q.txt.gz Answers to questions
    sx-askubuntu-c2q.txt.gz Comments to questions
    sx-askubuntu-c2a.txt.gz Comments to answers

    Data format

    SRC DST UNIXTS                             
    

    where edges are separated by a new line and

    SRC: id of the source node (a user)                  
    TGT: id of the target node (a user)                  
    UNIXTS: Unix timestamp (seconds since the epoch)            
                   ...
    
  9. 733 instances of Autonomous systems traffic (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). 733 instances of Autonomous systems traffic (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-as-735
    Explore at:
    zip(19603389 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The graph of routers comprising the Internet can be organized into sub-graphs called Autonomous Systems (AS). Each AS exchanges traffic flows with some
    neighbors (peers). We can construct a communication network of who-talks-to-
    whom from the BGP (Border Gateway Protocol) logs.

    The data was collected from University of Oregon Route Views Project
    (http://www.routeviews.org/) - Online data and reports. The dataset contains
    735 daily instances which span an interval of 785 days from November 8 1997 to January 2 2000. In contrast to citation networks, where nodes and edges only
    get added (not deleted) over time, the AS dataset also exhibits both the
    addition and deletion of the nodes and edges over time.

    Dataset statistics are calculated for the graph with the highest number of
    nodes and edges (dataset from January 02 2000):

    Dataset statistics
    Nodes 6474
    Edges 13233
    Nodes in largest WCC 6474 (1.000)
    Edges in largest WCC 13233 (1.000)
    Nodes in largest SCC 6474 (1.000)
    Edges in largest SCC 13233 (1.000)
    Average clustering coefficient 0.3913
    Number of triangles 6584
    Fraction of closed triangles 0.009591
    Diameter (longest shortest path) 9
    90-percentile effective diameter 4.6

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    Files
    File Description
    as20000102.txt.gz Autonomous Systems graph from January 02 2000
    as.tar.gz 735 Autonomous Systems graphs from November 8 1997 to
    January 02 2000

    NOTE: In the UF collection, the primary matrix (Problem.A) is the
    as20000102 matrix from January 02 2000 (the last graph in the sequence).

    The nodes are uniform across all graphs in the sequence in the UF collection. That is, nodes do not come and go. A node that is "gone" simply has no edges. This is to allow comparisons across each node in the graphs.
    Problem.aux.nodenames gives the node numbers of the original problem. So
    row/column i in the matrix is always node number Problem.aux.nodenames(i) in
    all the graphs.

    Problem.aux.G{k} is the kth graph in the sequence.
    Problem.aux.Gname(k,:) is the name of the kth graph.

  10. Counseling and Psychotherapy Transcripts: Volume II [full text data]

    • redivis.com
    application/jsonl +7
    Updated Feb 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford University Libraries (2023). Counseling and Psychotherapy Transcripts: Volume II [full text data] [Dataset]. http://doi.org/10.57761/zh3g-ch31
    Explore at:
    application/jsonl, arrow, spss, avro, stata, csv, parquet, sasAvailable download formats
    Dataset updated
    Feb 17, 2023
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford University Libraries
    Time period covered
    Feb 13, 2023
    Description

    Abstract

    This collection contains the plain text transcripts of therapy sessions. These transcripts are sourced from Alexander Street Press' Counseling and Psychotherapy Transcripts: Volume II. The collection features a diverse set of clients, a wide range of presenting issues, and multiple therapeutic approaches. Content was recorded in 2012 or later, and the transcripts were generally released between 2013 and 2015. The collection adheres to the American Psychological Association's Ethics Guidelines for use and anonymity.

    Usage

    For a complete list of transcripts, please see CTRN Metadata_QA completed by WS 1.6.21.xlsx (under Supporting files).

    This deposit is Stanford Libraries’ local copy of Counseling and Psychotherapy Transcripts: Volume II, which may be used for data mining. If your research requires you to read the text to understand or analyze it, you may use the corresponding ProQuest database, available under Links.

  11. Mining TCGA Data Using Boolean Implications

    • plos.figshare.com
    tiff
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subarna Sinha; Emily K. Tsang; Haoyang Zeng; Michela Meister; David L. Dill (2023). Mining TCGA Data Using Boolean Implications [Dataset]. http://doi.org/10.1371/journal.pone.0102119
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Subarna Sinha; Emily K. Tsang; Haoyang Zeng; Michela Meister; David L. Dill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Boolean implications (if-then rules) provide a conceptually simple, uniform and highly scalable way to find associations between pairs of random variables. In this paper, we propose to use Boolean implications to find relationships between variables of different data types (mutation, copy number alteration, DNA methylation and gene expression) from the glioblastoma (GBM) and ovarian serous cystadenoma (OV) data sets from The Cancer Genome Atlas (TCGA). We find hundreds of thousands of Boolean implications from these data sets. A direct comparison of the relationships found by Boolean implications and those found by commonly used methods for mining associations show that existing methods would miss relationships found by Boolean implications. Furthermore, many relationships exposed by Boolean implications reflect important aspects of cancer biology. Examples of our findings include cis relationships between copy number alteration, DNA methylation and expression of genes, a new hierarchy of mutations and recurrent copy number alterations, loss-of-heterozygosity of well-known tumor suppressors, and the hypermethylation phenotype associated with IDH1 mutations in GBM. The Boolean implication results used in the paper can be accessed at http://crookneck.stanford.edu/microarray/TCGANetworks/.

  12. e

    Data from: Academic offer of advanced digital technologies

    • data.europa.eu
    html, zip
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joint Research Centre (2023). Academic offer of advanced digital technologies [Dataset]. https://data.europa.eu/data/datasets/7aed1a89-c904-43ed-af0f-b024fc9cb92a?locale=bg
    Explore at:
    zip, htmlAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset authored and provided by
    Joint Research Centre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the result of a project to support policy making by providing insights on the availability and composition of education offer in four key digital domains: artificial intelligence, high performance computing, cybersecurity, and data science. Following a text mining methodology that captures the inclusion of advanced digital technologies in the programmes’ syllabus, we monitor the availability of masters’ programmes, bachelor’s programmes and short professional courses and study their characteristics. These include the scope or depth with which the digital content is taught (classified into broad or specialised), education fields in which digital technologies are embedded (e.g., Information and communication technologies, Business, administration and law), and the content areas covered by the programmes (e.g. robotics, machine learning). Also, we consider the overlap between the four domains, to identify complementarities and synergies in the academic offer of advanced digital technologies. The dataset covers yearly data, starting from the academic year 2019-2020 and ending in academic year 2023-24 (and will not be further updated). In order to provide comparison with other competing economies, the dataset covers the EU and its Member States plus six additional countries: the United Kingdom, Norway, Switzerland, Canada, the United States, and Australia. Results of the study have been used as reference in the European Artificial Intelligence Strategy, the White Paper on Artificial Intelligence – a European approach to excellence and trust, in the Stanford University’s Artificial Intelligence Index Report 2019 and 2021. These data have substantiated the assessment of the national Recovery and Resilience plans, and are used as input for the Digital Resilience Dashboard, among others.

  13. Autonomous System Graphs by Skitter (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Autonomous System Graphs by Skitter (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-as-skitter
    Explore at:
    zip(33194649 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Autonomous systems by Skitter

    Dataset information

    Internet topology graph. From traceroutes run daily in 2005 -
    http://www.caida.org/tools/measurement/skitter. From several scattered sources to million destinations. 1.7 million nodes, 11 million edges.

    Dataset statistics
    Nodes 1696415
    Edges 11095298
    Nodes in largest WCC 1694616 (0.999)
    Edges in largest WCC 11094209 (1.000)
    Nodes in largest SCC 1694616 (0.999)
    Edges in largest SCC 11094209 (1.000)
    Average clustering coefficient 0.2963
    Number of triangles 28769868
    Fraction of closed triangles 0.005387
    Diameter (longest shortest path) 25
    90-percentile effective diameter 5.9

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    Files
    File Description
    as-skitter.txt.gz AS from traceroutes run daily in 2005 by skitter

  14. Data from: A global network of biomedical relationships derived from text

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bethany Percha; Russ B. Altman; Bethany Percha; Russ B. Altman (2020). A global network of biomedical relationships derived from text [Dataset]. http://doi.org/10.5281/zenodo.3459420
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bethany Percha; Russ B. Altman; Bethany Percha; Russ B. Altman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains labeled, weighted networks of chemical-gene, gene-gene, gene-disease, and chemical-disease relationships based on single sentences in PubMed abstracts. All raw dependency paths are provided in addition to the labeled relationships.

    PART I: Connects dependency paths to labels, or "themes". Each record contains a dependency path followed by its score for each theme, and indicators of whether or not the path is part of the flagship path set for each theme (meaning that it was manually reviewed and determined to reflect that theme). The themes themselves are listed below and are in our paper (reference below).

    PART II: Connects sentences to dependency paths. It consists of sentences and associated metadata, entity pairs found in the sentences, and dependency paths connecting those entity pairs. Each record contains the following information:

    • PubMed ID
    • Sentence number (0 = title)
    • First entity name, formatted
    • First entity name, location (characters from start of abstract)
    • Second entity name, formatted
    • Second entity name, location
    • First entity name, raw string
    • Second entity name, raw string
    • First entity name, database ID(s)
    • Second entity name, database ID(s)
    • First entity type (Chemical, Gene, Disease)
    • Second entity type (Chemical, Gene, Disease)
    • Dependency path
    • Sentence, tokenized

    The "with-themes.txt" files only contain dependency paths with corresponding theme assignments from Part I. The plain ".txt" files contain all dependency paths.

    This release contains the annotated network for the September 15, 2019 version of PubTator. The version discussed in our paper, below, is an older one - from April 30, 2016. If you're interested in that network, it can be found in Version 1 of this repository. We will be releasing updated networks periodically, as the PubTator community continues to release new versions of named entity annotations for Medline each month or so.

    ------------------------------------------------------------------------------------
    REFERENCES

    Percha B, Altman RBA (2017) A global network of biomedical relationships derived from text. Bioinformatics, 34(15): 2614-2624.
    Percha B, Altman RBA (2015) Learning the structure of biomedical relationships from unstructured text. PLoS Computational Biology, 11(7): e1004216.

    This project depends on named entity annotations from the PubTator project:
    https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

    Reference:
    Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522.

    Dependency parsing was provided by the Stanford CoreNLP toolkit (version 3.9.1):
    https://stanfordnlp.github.io/CoreNLP/index.html

    Reference:
    Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

    ------------------------------------------------------------------------------------
    THEMES

    chemical-gene
    (A+) agonism, activation
    (A-) antagonism, blocking
    (B) binding, ligand (esp. receptors)
    (E+) increases expression/production
    (E-) decreases expression/production
    (E) affects expression/production (neutral)
    (N) inhibits

    gene-chemical
    (O) transport, channels
    (K) metabolism, pharmacokinetics
    (Z) enzyme activity

    chemical-disease
    (T) treatment/therapy (including investigatory)
    (C) inhibits cell growth (esp. cancers)
    (Sa) side effect/adverse event
    (Pr) prevents, suppresses
    (Pa) alleviates, reduces
    (J) role in disease pathogenesis

    disease-chemical
    (Mp) biomarkers (of disease progression)

    gene-disease
    (U) causal mutations
    (Ud) mutations affecting disease course
    (D) drug targets
    (J) role in pathogenesis
    (Te) possible therapeutic effect
    (Y) polymorphisms alter risk
    (G) promotes progression

    disease-gene
    (Md) biomarkers (diagnostic)
    (X) overexpression in disease
    (L) improper regulation linked to disease

    gene-gene
    (B) binding, ligand (esp. receptors)
    (W) enhances response
    (V+) activates, stimulates
    (E+) increases expression/production
    (E) affects expression/production (neutral)
    (I) signaling pathway
    (H) same protein or complex
    (Rg) regulation
    (Q) production by cell population

    ------------------------------------------------------------------------------------
    FORMATTING NOTE

    A few users have mentioned that the dependency paths in the "part-i" files are all lowercase text, whereas those in the "part-ii" files maintain the case of the original sentence. This complicates mapping between the two sets of files.

    We kept the part-ii files in the same case as the original sentence to facilitate downstream debugging - it's easier to tell which words in a particular sentence are contributing to the dependency path if their original case is maintained. When working with the part-ii "with-themes" files, if you simply convert the dependency path to lowercase, it is guaranteed to match to one of the paths in the corresponding part-i file and you'll be able to get the theme scores.

    Apologies for the additional complexity, and please reach out to us if you have any questions (see correspondence information in the Bioinformatics manuscript, above).

  15. s

    Marine Jurisdictions, Northeast United States, 2010

    • searchworks.stanford.edu
    zip
    Updated Nov 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Marine Jurisdictions, Northeast United States, 2010 [Dataset]. https://searchworks.stanford.edu/view/zk816hk1203
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 9, 2021
    Area covered
    Northeastern United States, United States
    Description

    a standardized compilation of published marine boundaries from the NOAA Office of Coast Survey and the Minerals Management Service. Authoritative marine boundary data and information is available from the Agency of Responsibility, although often only at a regional-scale and in a number of differing and disparate formats. The NOAA Coastal Services Center, working in conjunction with the Agency of Responsibility through the FGDC Marine Boundary Working Group, has compiled and standardized several datasets to create a national-scale, standardized data set based on several OGC and FGDC standards. Such datasets help to alleviate the initial cost of data mining, collecting, and the processing necessary to utilize such datasets for a variety of uses. Not for use for scales greater than 1:25,000.

  16. Citation Networks (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Citation Networks (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-cit
    Explore at:
    zip(95620457 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High-energy physics citation network

    Dataset information

    Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the
    e-print arXiv and covers all the citations within a dataset of 34,546 papers
    with 421,578 edges. If a paper i cites paper j, the graph contains a directed
    edge from i to j. If a paper cites, or is cited by, a paper outside the
    dataset, the graph does not contain any information about this.

    The data covers papers in the period from January 1993 to April 2003 (124
    months). It begins within a few months of the inception of the arXiv, and thus
    represents essentially the complete history of its HEP-PH section.

    The data was originally released as a part of 2003 KDD Cup.

    Dataset statistics
    Nodes 34546
    Edges 421578
    Nodes in largest WCC 34401 (0.996)
    Edges in largest WCC 421485 (1.000)
    Nodes in largest SCC 12711 (0.368)
    Edges in largest SCC 139981 (0.332)
    Average clustering coefficient 0.2962
    Number of triangles 1276868
    Fraction of closed triangles 0.1457
    Diameter (longest shortest path) 12
    90-percentile effective diameter 5

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    J. Gehrke, P. Ginsparg, J. M. Kleinberg. Overview of the 2003 KDD Cup. SIGKDD
    Explorations 5(2): 149-151, 2003.

    Files
    File Description
    cit-HepPh.txt.gz Paper citation network of Arxiv High Energy Physics category cit-HepPh-dates.txt.gz Time of nodes (paper submission time to Arxiv)

    High-energy physics theory citation network

    Dataset information

    Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print
    arXiv and covers all the citations within a dataset of 27,770 papers with
    352,807 edges. If a paper i cites paper j, the graph contains a directed edge
    from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.

    The data covers papers in the period from January 1993 to April 2003 (124
    months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its HEP-TH section.

    The data was originally released as a part of 2003 KDD Cup.

    Dataset statistics
    Nodes 27770
    Edges 352807
    Nodes in largest WCC 27400 (0.987) ...

  17. Gowalla Checkins

    • kaggle.com
    zip
    Updated Nov 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bqlearner (2017). Gowalla Checkins [Dataset]. https://www.kaggle.com/bqlearner/gowalla-checkins
    Explore at:
    zip(105113346 bytes)Available download formats
    Dataset updated
    Nov 15, 2017
    Authors
    bqlearner
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Gowalla is a location-based social networking website where users share their locations by checking-in.

    Content

    Time and location information of check-ins made by users.

    Acknowledgements

    This data set is available from https://snap.stanford.edu/data/loc-gowalla.html

    E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011.

  18. Twitter Posts Network (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Twitter Posts Network (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-twitter7/code
    Explore at:
    zip(6560110658 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    476 million Twitter tweets

    Dataset information

    467 million Twitter posts from 20 million users covering a 7 month period
    from June 1 2009 to December 31 2009. We estimate this is about 20-30% of
    all public tweets published on Twitter during the particular time frame.

    For each public tweet the following information was available:

    Author                                 
    Time                                  
    Content                                
    

    We have no Twitter social graph (who-follows-whom graph) available. You can find a copy of the graph at http://an.kaist.ac.kr/traces/WWW2010.html
    (thanks to Haewoon Kwak, et al.).

    Dataset statistics
    Number of users 17,069,982
    Number of tweets 476,553,560
    Number of URLs 181,611,080
    Number of Hashtags 49,293,684
    Number of re-tweets 71,835,017

    Source (citation)
    J. Yang, J. Leskovec. Temporal Variation in Online Media. ACM Intl.
    Conf. on Web Search and Data Mining (WSDM '11), 2011.

    As per request from Twitter the data is no longer available.

    http://an.kaist.ac.kr/traces/WWW2010.html :

    What is Twitter, a Social Network or a News Media?

    Haewoon Kwak (http://an.kaist.ac.kr/~haewoon),
    Changhyun Lee (http://an.kaist.ac.kr/~chlee),
    Hosung Park (http://an.kaist.ac.kr/~hosung),
    and Sue Moon (http://an.kaist.ac.kr/~sbmoon)

    Proceedings of the 19th International World Wide Web (WWW) Conference,
    April 26-30, 2010, Raleigh NC (USA)

    Twitter, a microblogging service less than three years old, commands more
    than 41 million users as of July 2009 and is growing fast. Twitter users
    tweet about any topic within the 140-character limit and follow others to
    receive their tweets. The goal of this paper is to study the topological
    characteristics of Twitter and its power as a new medium of information
    sharing.

    We have crawled the entire Twitter site and obtained 41.7 million user
    profiles, 1.47 billion social relations, 4,262 trending topics, and 106
    million tweets. In its follower-following topology analysis we have found a non-power-law follower distribution, a short effective diameter, and low
    reciprocity, which all mark a deviation from known characteristics of human social networks~\cite{Newman03}. In order to identify influentials on
    Twitter, we have ranked users by the number of followers and by PageRank
    and found two rankings to be similar. Ranking by retweets differs from the previous two rankings, indicating a gap in influence inferred from the
    number of followers and that from the popularity of one's tweets. We have
    analyzed the tweets of top trending topics and reported on their temporal
    behavior and user participation. We have classified the trending topics
    based on the active period and th...

  19. 122 CAIDA Autonomous systems Graphs (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). 122 CAIDA Autonomous systems Graphs (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-as-caida
    Explore at:
    zip(40197223 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains 122 CAIDA AS graphs, from January 2004 to November 2007 - http://www.caida.org/data/active/as-relationships/ . Each file contains a full AS graph derived from a set of RouteViews BGP table snapshots.

    Dataset statistics are calculated for the graph with the highest number of
    nodes - dataset from November 5 2007. Dataset statistics for graph with
    highest number of nodes - 11 5 2007

    Nodes 26475
    Edges 106762
    Nodes in largest WCC 26475 (1.000)
    Edges in largest WCC 106762 (1.000)
    Nodes in largest SCC 26475 (1.000)
    Edges in largest SCC 106762 (1.000)
    Average clustering coefficient 0.2082
    Number of triangles 36365
    Fraction of closed triangles 0.007319
    Diameter (longest shortest path) 17
    90-percentile effective diameter 4.6

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    Files
    File Description
    as-caida20071105.txt.gz CAIDA AS graph from November 5 2007
    as-caida.tar.gz 122 CAIDA AS graphs from January 2004 to November 2007

    NOTE for UF Sparse Matrix Collection: these graphs are weighted. In the
    original SNAP data set, the edge weights are in the set {-1, 0, 1, 2}. Note
    that "0" is an edge weight. This can be handled in the UF collection for the
    primary sparse matrix in a Problem, but not when the matrices are in a sequence in the Problem.aux MATLAB struct. The entries with zero edge weight would
    become lost. To correct for this, the weights are modified by adding 2 to each weight. This preserves the structure of the original graphs, so that edges
    with weight zero are not lost. (A non-edge is not the same as an edge with
    weight zero in this problem).

    old new weights:                              
    -1 1                                   
    0  2                                   
    1  3                                   
    2  4                                   
    

    So to obtain the original weights, subtract 2 from each entry.

    The primary sparse matrix for this problem is the as-caida20071105 matrix, or
    Problem.aux.G{121}, the second-to-the-last graph in the sequence.

    The nodes are uniform across all graphs in the sequence in the UF collection.
    That is, nodes do not come and go. A node that is "gone" simply has no edges. This is to allow comparisons across each node in the graphs.
    Problem.aux.nodenames gives the node numbers of the original problem. So
    row/column i in the matrix is always node number Problem.aux.nodenames(i) in
    all the graphs.

    Problem.aux.G{k} is the kth graph in the sequence.
    Problem.aux.Gname(k,:) is the name of the kth graph.

  20. SNAP Memetracker

    • kaggle.com
    zip
    Updated Nov 21, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Network Analysis Project (2016). SNAP Memetracker [Dataset]. https://www.kaggle.com/snap/snap-memetracker
    Explore at:
    zip(922206150 bytes)Available download formats
    Dataset updated
    Nov 21, 2016
    Dataset authored and provided by
    Stanford Network Analysis Project
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This database contains a subset of the Memetracker dataset collected by SNAP.

    The full Memetracker dataset has observations broken into months. Because of size considerations, however, this version consists of one-half of a month: the first 15 days of Memetracker observations from November 2008.

    About

    Memetracker tracks the quotes and phrases that appear most frequently over time across the entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly.

    Overall Memetracker tracks more than 17 million different phrases and about 54% of the total phrase/quote mentions appear on blogs and 46% in news media.

    Acknowledgments

    This dataset was collected by the Stanford Network Analysis Project. Detailed information about the data and its analysis can be found at the website here.

    An analysis of this dataset was published here:
    J. Leskovec, L. Backstrom, J. Kleinberg. Meme-tracking and the Dynamics of the News Cycle. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2009.

    The Data

    The SQLite database contains three tables:

    articles: 4,542,920 records, with the following fields:

    • article_id: a unique id for the article (int)
    • url: the URL of the article (text)
    • date: the date of the article (text), in the strptime format '%Y-%m-%d %H:%M:%S'

    quotes: 7,956,125 records, with the following fields:

    • article_id: unique id for the article that this quote was found in (int)
    • phrase: the high-frequency phrase found in the article (text)

    links: 16,727,125 records, with the following fields:

    • article_id: unique id for the article that this link was found in (int)
    • link_out: the URL of the link out (text)
    • link_out_id: unique id for the target article (int), if it exists; else NULL
  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Subhajit Sahu (2021). Location Graphs (SNAP) [Dataset]. https://www.kaggle.com/wolfram77/graphs-snap-loc
Organization logo

Location Graphs (SNAP)

Graphs from the Stanford Network Analysis Platform

Explore at:
zip(163822208 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

loc-Brightkite

https://snap.stanford.edu/data/loc-Brightkite.html

Dataset information

Brightkite (http://www.brightkite.com/) was once a location-based social
networking service provider where users shared their locations by
checking-in. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges. The network is originally
directed but we have constructed a network with undirected edges when there is a friendship in both ways. We have also collected a total of 4,491,143
checkins of these users over the period of Apr. 2008 - Oct. 2010.

Dataset statistics
Nodes 58,228
Edges 214,078
Nodes in largest WCC 56739 (0.974)
Edges in largest WCC 212945 (0.995)
Nodes in largest SCC 56739 (0.974)
Edges in largest SCC 212945 (0.995)
Average clustering coefficient 0.1723
Number of triangles 494728
Fraction of closed triangles 0.03979
Diameter (longest shortest path) 16
90-percentile effective diameter 6
Checkins 4,491,143

Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf

Files
File Description
loc-brightkite_edges.txt.gz Friendship network of Brightkite users
loc-brightkite_totalCheckins.txt.gz
Time and location information of check-ins made by users

Example of check-in information

[user][check-in time]   [latitude] [longitude] [location id]    
58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411    
58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411    
58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411    
58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411    
58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8    
58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8    
58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e    
58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e    
58188 2010-04-06T06:45:19Z 46.521389  14.854444 ddaa40aaa22411    
58188 2008-12-30T15:30:08Z 46.522621  14.849618 58e12bc0d67e11    
58189 2009-04-08T07:36:46Z 46.554722  15.646667 ddaf9c4ea22411    
58190 2009-04-08T07:01:28Z 46.421389  15.869722 dd793f96a22411    

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

The SNAP data set is 0-based, with nodes numbered 0 to 58,227.

In the SuiteSparse Matrix Collection the graph is converted to 1-based.
The Problem.A matrix is the undirected friendship network, where
A(i,j)=1 if person 1+i and person 1+j are friends in the SNAP data set.

There are 4,747,287 checkins in the loc-brightkite_totalCheckins.txt
file, but 6 lines are empty with a user id but no other data (those
are discarded here). In the SuiteSparse Matrix Collection, the checkin
data is held in 5 vectors of length 4,747,281. These are in the
Problem.aux component of the MATLAB struct. The kth entry of each of
these vectors holds the data in the kth line of the
loc-brightkite_totalCheckins.txt file (after deleting the 6 empty lines).

userid: the SNAP user id is an integer in the range 0 to 58,227. It  
  has been incremented by one, here, to reflect the corresponding  
  row and column of the Problem.A matrix. It contains 51,406    
  unique user id's.                         
checkin_time: a string of length 20                  
latitude: a double precision number                  
longitude: a double precision number                  
location_id: a string of length 61.

loc-Gowalla

https://snap.stanford.edu/data/loc-Gowalla.html

Dataset information

Gowalla (http://www.gowalla.com/) is a location-based social networking
website where users share their locations by checking-in. The friendship
network is undirected and was collected using their public API, and
consists of 196,591 nodes and 950,327 edges. We have collected a total of
6,442,890 check-ins of these users over the period of Feb. 2009 - Oct.
2010.

Dataset statistics
Nodes 196,591
Edges 950,327
Nodes in largest WCC 196591 (1.000)
Edges in largest WCC 950327 (1.000)
Nodes in largest SCC 196591 (1.000)
Edges in largest SCC 950327 (1.000)
Average clustering coefficient 0.2367
Number of triangles 2273138
Fraction of closed triangles 0.007952
Diameter (longest shortest path) 14
90-percentile effective diameter 5.7
Check-ins 6,442,890

Source (citation)
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and
Mobility: User Movement in Location-Based Social Networks ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD),
2011. http://cs.stanford.edu/people/jure/pubs/mobile-kdd11.pdf

Files
File Description
loc-gowalla_edges.txt.gz Friendship network of Gowalla users
loc-gowalla_totalCheckins.txt.gz Time and location information
of check-ins made by users

Example of check-in information

[user] [check-in time]   [latitude]  [longitude] [location id]  
196514 2010-07-24T13:45:06Z 53.3648119  -2.2723465833  145064   
196514 2010-07-24T13:44:58Z 53.360511233 -2.276369017  1275991   
196514 2010-07-24T13:44:46Z 53.3653895945 -2.2754087046  376497   
196514 2010-07-24T13:44:38Z 53.3663709833 -2.2700764333  98503    
196514 2010-07-24T13:44:26Z 53.3674087524 -2.2783813477  1043431   
196514 2010-07-24T13:44:08Z 53.3675663377 -2.278631763  881734   
196514 2010-07-24T13:43:18Z 53.3679640626 -2.2792943689  207763   
196514 2010-07-24T13:41:10Z 53.364905   -2.270824    1042822
Search
Clear search
Close search
Google apps
Main menu