100+ datasets found
  1. P

    Yelp Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Yelp Dataset [Dataset]. https://paperswithcode.com/dataset/yelp
    Explore at:
    Description

    The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world data related to businesses, reviews, and user interactions. Here are the key details about the Yelp Dataset: Reviews: A whopping 6,990,280 reviews from users. Businesses: Information on 150,346 businesses. Pictures: A collection of 200,100 pictures. Metropolitan Areas: Data from 11 metropolitan areas. Tips: Over 908,915 tips provided by 1,987,897 users. Business Attributes: Details like hours, parking availability, and ambiance for more than 1.2 million businesses. Aggregated Check-ins: Historical check-in data for each of the 131,930 businesses.

  2. yelp_review_full

    • huggingface.co
    Updated Mar 6, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yelp (2012). yelp_review_full [Dataset]. https://huggingface.co/datasets/Yelp/yelp_review_full
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 6, 2012
    Dataset authored and provided by
    Yelphttp://yelp.com/
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for YelpReviewFull

      Dataset Summary
    

    The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data.

      Supported Tasks and Leaderboards
    

    text-classification, sentiment-classification: The dataset is mainly used for text classification: given the text, predict the sentiment.

      Languages
    

    The reviews were mainly written in english.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    A… See the full description on the dataset page: https://huggingface.co/datasets/Yelp/yelp_review_full.

  3. Yelp dataset 2024

    • kaggle.com
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    snax07 (2024). Yelp dataset 2024 [Dataset]. https://www.kaggle.com/datasets/snax07/yelp-dataset-2024
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    snax07
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Yelp Dataset JSON Each file is composed of a single object type, one JSON-object per-line.

    Take a look at some examples to get you started: https://github.com/Yelp/dataset-examples.

    Note: the follow examples contain inline comments, which are technically not valid JSON. This is done here to simplify the documentation and explaining the structure, the JSON files you download will not contain any comments and will be fully valid JSON.

    business.json Contains business data including location data, attributes, and categories.

    { // string, 22 character unique string business id "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // string, the business's name
    "name": "Garaje",
    
    // string, the full address of the business
    "address": "475 3rd St",
    
    // string, the city
    "city": "San Francisco",
    
    // string, 2 character state code, if applicable
    "state": "CA",
    
    // string, the postal code
    "postal code": "94107",
    
    // float, latitude
    "latitude": 37.7817529521,
    
    // float, longitude
    "longitude": -122.39612197,
    
    // float, star rating, rounded to half-stars
    "stars": 4.5,
    
    // integer, number of reviews
    "review_count": 1198,
    
    // integer, 0 or 1 for closed or open, respectively
    "is_open": 1,
    
    // object, business attributes to values. note: some attribute values might be objects
    "attributes": {
      "RestaurantsTakeOut": true,
      "BusinessParking": {
        "garage": false,
        "street": true,
        "validated": false,
        "lot": false,
        "valet": false
      },
    },
    
    // an array of strings of business categories
    "categories": [
      "Mexican",
      "Burgers",
      "Gastropubs"
    ],
    
    // an object of key day to value hours, hours are using a 24hr clock
    "hours": {
      "Monday": "10:00-21:00",
      "Tuesday": "10:00-21:00",
      "Friday": "10:00-21:00",
      "Wednesday": "10:00-21:00",
      "Thursday": "10:00-21:00",
      "Sunday": "11:00-18:00",
      "Saturday": "10:00-21:00"
    }
    

    } review.json Contains full review text data including the user_id that wrote the review and the business_id the review is written for.

    { // string, 22 character unique review id "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",
    
    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",
    
    // integer, star rating
    "stars": 4,
    
    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",
    
    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
    
    // integer, number of useful votes received
    "useful": 0,
    
    // integer, number of funny votes received
    "funny": 0,
    
    // integer, number of cool votes received
    "cool": 0
    

    } user.json User data including the user's friend mapping and all the metadata associated with the user.

    { // string, 22 character unique user id, maps to the user in user.json "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, the user's first name
    "name": "Sebastien",
    
    // integer, the number of reviews they've written
    "review_count": 56,
    
    // string, when the user joined Yelp, formatted like YYYY-MM-DD
    "yelping_since": "2011-01-01",
    
    // array of strings, an array of the user's friend as user_ids
    "friends": [
      "wqoXYLWmpkEH0YvTmHBsJQ",
      "KUXLLiJGrjtSsapmxmpvTA",
      "6e9rJKQC3n0RSKyHLViL-Q"
    ],
    
    // integer, number of useful votes sent by the user
    "useful": 21,
    
    // integer, number of funny votes sent by the user
    "funny": 88,
    
    // integer, number of cool votes sent by the user
    "cool": 15,
    
    // integer, number of fans the user has
    "fans": 1032,
    
    // array of integers, the years the user was elite
    "elite": [
      2012,
      2013
    ],
    
    // float, average rating of all reviews
    "average_stars": 4.31,
    
    // integer, number of hot compliments received by the user
    "compliment_hot": 339,
    
    // integer, number of more compliments received by the user
    "compliment_more": 668,
    
    // integer, number of profile compliments received by the user
    "compliment_profile": 42,
    
    // integer, number of cute compliments received by the user
    "compliment_cute": 62,
    
    // integer, number of list compliments received by the user
    "compliment_list": 37,
    
    // integer, number of note compliments received by the user
    "compliment_note": 356,
    
    // integer, number of plain compliments received by the user
    "compliment_plain": 68,
    
    // integer, number of coo...
    
  4. h

    yelp-dataset

    • huggingface.co
    Updated Apr 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noah Nkalubo Nsimbe (2024). yelp-dataset [Dataset]. https://huggingface.co/datasets/noahnsimbe/yelp-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2024
    Authors
    Noah Nkalubo Nsimbe
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for Dataset Name

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

      Dataset Sources [optional]
    

    Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/noahnsimbe/yelp-dataset.

  5. Yelp Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data, Yelp Datasets [Dataset]. https://brightdata.com/products/datasets/yelp
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Use our Yelp dataset to discover and review local businesses, such as restaurants, bars, cafes, hotels, and more. The Yelp dataset is a complementary dataset to the Yelp businesses overview and includes full information on each review filled on a business. Datapoints include:timestamp, business_id, review_author, rating, date, content, review_image, reactions, replies and more.

  6. P

    Yelp-Fraud Dataset

    • paperswithcode.com
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu (2025). Yelp-Fraud Dataset [Dataset]. https://paperswithcode.com/dataset/yelpchi
    Explore at:
    Dataset updated
    Apr 21, 2025
    Authors
    Yingtong Dou; Zhiwei Liu; Li Sun; Yutong Deng; Hao Peng; Philip S. Yu
    Description

    Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.

    Dataset Statistics

    # Nodes%Fraud Nodes (Class=1)
    45,95414.5
    Relation# Edges
    R-U-R
    R-T-R
    R-S-R3,402,743
    All

    Graph Construction

    The Yelp spam review dataset includes hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp. We conduct a spam review detection task on the Yelp-Fraud dataset which is a binary classification task. We take 32 handcrafted features from SpEagle paper as the raw node features for Yelp-Fraud. Based on previous studies which show that opinion fraudsters have connections in user, product, review text, and time, we take reviews as nodes in the graph and design three relations: 1) R-U-R: it connects reviews posted by the same user; 2) R-S-R: it connects reviews under the same product with the same star rating (1-5 stars); 3) R-T-R: it connects two reviews under the same product posted in the same month.

    To download the dataset, please visit this Github repo. For any other questions, please email ytongdou(AT)gmail.com for inquiry.

  7. Yelp Reviews Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Yelp Reviews Dataset [Dataset]. https://brightdata.com/products/datasets/yelp/reviews
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Nov 25, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Yelp Reviews dataset to explore ratings and reviews for local businesses, including restaurants, bars, cafes, and hotels. Popular use cases include analyzing customer sentiment, benchmarking business performance, and gaining insights into local market trends. Datapoints include: business ID, review author, rating, date, content, image, and more.

  8. T

    yelp_polarity_reviews

    • tensorflow.org
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). yelp_polarity_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    Large Yelp Review Dataset. This is a dataset for binary sentiment classification. We provide a set of 560,000 highly polar yelp reviews for training, and 38,000 for testing. ORIGIN The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. For more information, please refer to http://www.yelp.com/dataset

    The Yelp reviews polarity dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset. It is first used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

    DESCRIPTION

    The Yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 trainig samples and 38,000 testing samples. Negative polarity is class 1, and positive class 2.

    The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 2 columns in them, corresponding to class index (1 and 2) and review text. The review texts are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('yelp_polarity_reviews', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  9. yelp-csv

    • kaggle.com
    Updated Jan 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flyer Steve (2023). yelp-csv [Dataset]. https://www.kaggle.com/datasets/flyersteve/yelp-csv/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Flyer Steve
    Description

    Dataset

    This dataset was created by Flyer Steve

    Contents

  10. d

    Louisville Metro KY - YELP Data businesses

    • datasets.ai
    • catalog.data.gov
    15, 21, 3, 8
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louisville Metro Government (2024). Louisville Metro KY - YELP Data businesses [Dataset]. https://datasets.ai/datasets/louisville-metro-ky-yelp-data-businesses
    Explore at:
    15, 21, 8, 3Available download formats
    Dataset updated
    Sep 11, 2024
    Dataset authored and provided by
    Louisville Metro Government
    Area covered
    Louisville, Kentucky
    Description

    Listing of geocoded businesses, inspections for those businesses, and health violations for those businesses, used as a feed to Yelp. All files are csv files.

    Data Dictionary Type

    Contact:

    Gerald Kaforski

    gerald.kaforski@louisvilleky.gov

  11. a

    Yelp reviews - Full

    • academictorrents.com
    bittorrent
    Updated Oct 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Zhang et al., 2015 (2018). Yelp reviews - Full [Dataset]. https://academictorrents.com/details/66ab083bda0c508de6c641baabb1ec17f72dc480
    Explore at:
    bittorrent(196146755)Available download formats
    Dataset updated
    Oct 16, 2018
    Dataset authored and provided by
    Xiang Zhang et al., 2015
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    1,569,264 samples from the Yelp Dataset Challenge 2015. This full dataset has 130,000 training samples and 10,000 testing samples in each star.

  12. Yelp Dataset

    • kaggle.com
    zip
    Updated Mar 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yelp, Inc. (2022). Yelp Dataset [Dataset]. https://www.kaggle.com/yelp-dataset/yelp-dataset
    Explore at:
    zip(4374983563 bytes)Available download formats
    Dataset updated
    Mar 17, 2022
    Dataset provided by
    Yelphttp://yelp.com/
    Authors
    Yelp, Inc.
    Description

    Context

    This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.

    Content

    This dataset contains five JSON files and the user agreement. More information about those files can be found here.

    Code snippet to read the files

    in Python, you can read the JSON files like this (using the json and pandas libraries):

    import json
    import pandas as pd
    data_file = open("yelp_academic_dataset_checkin.json")
    data = []
    for line in data_file:
     data.append(json.loads(line))
    checkin_df = pd.DataFrame(data)
    data_file.close()
    
    
  13. h

    yelp

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MultifacetedNLPDatasets, yelp [Dataset]. https://huggingface.co/datasets/recmeapp/yelp
    Explore at:
    Authors
    MultifacetedNLPDatasets
    Description

    A quick usage example of Yelp dataset.

      install datasets library
    

    %pip install datasets

      import load_dataset
    

    from datasets import load_dataset

      Reading the Dataset
    

    ds = load_dataset("recmeapp/yelp", "main_data")

      Reading the App MetaData
    

    app_metadata = load_dataset("recmeapp/yelp", "app_meta")

      How many dialogs are there in different splits?
    

    train_data = ds['train'] valid_data = ds['val'] test_data = ds['test']

    print(f'There are… See the full description on the dataset page: https://huggingface.co/datasets/recmeapp/yelp.

  14. Yelp Open Dataset

    • live.european-language-grid.eu
    json
    Updated Dec 30, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yelp (2015). Yelp Open Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5179
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 30, 2015
    Dataset authored and provided by
    Yelphttp://yelp.com/
    License

    https://s3-media0.fl.yelpcdn.com/assets/srv0/engineering_pages/bea5c1e92bf3/assets/vendor/yelp-dataset-agreement.pdfhttps://s3-media0.fl.yelpcdn.com/assets/srv0/engineering_pages/bea5c1e92bf3/assets/vendor/yelp-dataset-agreement.pdf

    Description

    Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories.

  15. Yelp 2015

    • figshare.com
    txt
    Updated May 21, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeping Yu (2018). Yelp 2015 [Dataset]. http://doi.org/10.6084/m9.figshare.6292334.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 21, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Zeping Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a subset of the Yelp Challenge, it contains all the reviews in the year of 2015

  16. a

    Yelp reviews - Polarity

    • academictorrents.com
    bittorrent
    Updated Oct 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Zhang et al., 2015 (2018). Yelp reviews - Polarity [Dataset]. https://academictorrents.com/details/271777225ff3c6dec8055e231c70731a1da2518f
    Explore at:
    bittorrent(166373201)Available download formats
    Dataset updated
    Oct 16, 2018
    Dataset authored and provided by
    Xiang Zhang et al., 2015
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity.

  17. Z

    The Yelp Collaborative Knowledge Graph

    • data.niaid.nih.gov
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olesen, Magnus (2023). The Yelp Collaborative Knowledge Graph [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7878446
    Explore at:
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Olesen, Magnus
    Corfixen, Mads
    Heede, Thomas
    Nielsen, Christian Filip Pinderup
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the The Yelp Collaborative Knowledge Graph (YCKG) - a transformation of the Yelp Open Dataset into RDF format using Y2KG.

    Paper Abstract

    The Yelp Open Dataset (YOD) contains data about businesses, reviews, and users from the Yelp website and is available for research purposes. This dataset has been widely used to develop and test Recommender Systems (RS), especially those using Knowledge Graphs (KGs), e.g., integrating taxonomies, product categories, business locations, and social network information. Unfortunately, researchers applied naive or wrong mappings while converting YOD in KGs, consequently obtaining unrealistic results. Among the various issues, the conversion processes usually do not follow state-of-the-art methodologies, fail to properly link to other KGs and reuse existing vocabularies. In this work, we overcome these issues by introducing Y2KG, a utility to convert the Yelp dataset into a KG. Y2KG consists of two components. The first is a dataset including (1) a vocabulary that extends Schema.org with properties to describe the concepts in YOD and (2) mappings between the Yelp entities and Wikidata. The second component is a set of scripts to transform YOD in RDF and obtain the Yelp Collaborative Knowledge Graph (YCKG). The design of Y2KG was driven by 16 core competency questions. YCKG includes 150k businesses and 16.9M reviews from 1.9M distinct real users, resulting in over 244 million triples (with 144 distinct predicates) for about 72 million resources, with an average in-degree and out-degree of 3.3 and 12.2, respectively.

    Links

    Latest GitHub release: https://github.com/MadsCorfixen/The-Yelp-Collaborative-Knowledge-Graph/releases/latest

    PURL domain: https://purl.archive.org/domain/yckg

    Files

    Graph Data Triple Files

    One sample file for each of the Yelp domains (Businesses, Users, Reviews, Tips and Checkins), each containing 20 entities.

    yelp_schema_mappings.nt.gz containing the mappings from Yelp categories to Schema things.

    schema_hierarchy.nt.gz containing the full hierarchy of the mapped Schema things.

    yelp_wiki_mappings.nt.gz containing the mappings from Yelp categories to Wikidata entities.

    wikidata_location_mappings.nt.gz containing the mappings from Yelp locations to Wikidata entities.

    Graph Metadata Triple Files

    yelp_categories.ttl contains metadata for all Yelp categories.

    yelp_entities.ttl contains metadata regarding the dataset

    yelp_vocabulary.ttl contains metadata on the created Yelp vocabulary and properties.

    Utility Files

    yelp_category_schema_mappings.csv. This file contains the 310 mappings from Yelp categories to Schema types. These mappings have been manually verified to be correct.

    yelp_predicate_schema_mappings.csv. This file contains the 14 mappings from Yelp attributes to Schema properties. These mappings are manually found.

    ground_truth_yelp_category_schema_mappings.csv. This file contains the ground truth, based on 200 manually verified mappings from Yelp categories to Schema things. The ground truth mappings were used to calculate precision and recall for the semantic mappings.

    manually_split_categories.csv. This file contains all Yelp categories containing either a & or /, and their manually split versions. The split versions have been used in the semantic mappings to Schema things.

  18. d

    Replication Data for: \"A Topic-based Segmentation Model for Identifying...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
    Description

    We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...

  19. H

    Yelp Reviews in Boston, MA

    • dataverse.harvard.edu
    Updated Oct 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qiliang Chen; Riley Tucker; Babak Heydari; Daniel T. O'Brien (2020). Yelp Reviews in Boston, MA [Dataset]. http://doi.org/10.7910/DVN/DMWCBT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Qiliang Chen; Riley Tucker; Babak Heydari; Daniel T. O'Brien
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Massachusetts, Boston
    Description

    These datasets include information about Yelp restaurant reviews for the city of Boston processed from data scraped by BARI. We have generated a list of Boston restaurants by searching all of Boston's zipcodes on Yelp and then verifying that each identified restaurant has an address that falls within Boston's boundaries. YELP.Reviews is a review-level file that contains information about reviews posted on Yelp. YELP.Restaurants is a restaurant-level file that contains information about the restaurants on Yelp. Restaurant data has been aggregated across census tracts to generate YELP.CT, which includes ecometrics that describe neighborhoods in terms of frequency of reviews.

  20. o

    Same Sentiment Classification Train/Dev/Test Pair IDs

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Sep 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik Körner; Ahmad Dawar Hakimi; Gerhard Heyer; Martin Potthast (2021). Same Sentiment Classification Train/Dev/Test Pair IDs [Dataset]. http://doi.org/10.5281/zenodo.5495793
    Explore at:
    Dataset updated
    Sep 9, 2021
    Authors
    Erik Körner; Ahmad Dawar Hakimi; Gerhard Heyer; Martin Potthast
    Description

    This "dataset" only includes the compiled pairings of the Yelp Business Review Dataset. To get access to the actual review texts, please follow the instructions on the Yelp Dataset webpage. The data format is JSONlines. Python Load Example: import pandas as pd traindev_df = pd.read_json("df_traindev.jsonl", lines=True) test_df = pd.read_json("df_test.jsonl", lines=True) # example access to single business/review id s1_bid = test_df.iloc[0]["sent1_business_id"] s1_rid = test_df.iloc[0]["sent1_review_id"] s2_bid = test_df.iloc[0]["sent2_business_id"] s2_rid = test_df.iloc[0]["sent2_review_id"] label = test_df.iloc[0]["is_same_side"] See documentation at: Yelp Dataset Schemata (only business.json and review.json were used) Yelp Business Category Hierarchy (download the json file as all_category_list.json) For details on how the data was compiled and used in our experiments, please refer to our code repository. Other derived data splits can be reproduced deterministically by using the same random seed as in our experiments.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). Yelp Dataset [Dataset]. https://paperswithcode.com/dataset/yelp

Yelp Dataset

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
Description

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world data related to businesses, reviews, and user interactions. Here are the key details about the Yelp Dataset: Reviews: A whopping 6,990,280 reviews from users. Businesses: Information on 150,346 businesses. Pictures: A collection of 200,100 pictures. Metropolitan Areas: Data from 11 metropolitan areas. Tips: Over 908,915 tips provided by 1,987,897 users. Business Attributes: Details like hours, parking availability, and ambiance for more than 1.2 million businesses. Aggregated Check-ins: Historical check-in data for each of the 131,930 businesses.

Search
Clear search
Close search
Google apps
Main menu