6 datasets found

d
Replication Data for: \"A Topic-based Segmentation Model for Identifying...
search.dataone.org
dataverse.harvard.edu
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EE3DE2
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
Description
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Z
Same Sentiment Classification Train/Dev/Test Pair IDs
data.niaid.nih.gov
zenodo.org
Updated Jun 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerhard Heyer (2022). Same Sentiment Classification Train/Dev/Test Pair IDs [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5495792
Explore at:
Dataset updated
Jun 14, 2022
Dataset provided by
Erik Körner
Ahmad Dawar Hakimi
Gerhard Heyer
Martin Potthast
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This "dataset" only includes the compiled pairings of the Yelp Business Review Dataset. To get access to the actual review texts, please follow the instructions on the Yelp Dataset webpage.

The data format is JSONlines. Python Load Example:

import pandas as pd traindev_df = pd.read_json("df_traindev.jsonl", lines=True) test_df = pd.read_json("df_test.jsonl", lines=True)

example access to single business/review id

s1_bid = test_df.iloc[0]["sent1_business_id"] s1_rid = test_df.iloc[0]["sent1_review_id"] s2_bid = test_df.iloc[0]["sent2_business_id"] s2_rid = test_df.iloc[0]["sent2_review_id"] label = test_df.iloc[0]["is_same_side"]

See documentation at:

Yelp Dataset Schemata (only business.json and review.json were used)

Yelp Business Category Hierarchy (download the json file as all_category_list.json)

For details on how the data was compiled and used in our experiments, please refer to our code repository. Other derived data splits can be reproduced deterministically by using the same random seed as in our experiments.
ScrapeHero Data Cloud - Free and Easy to use
datarade.ai
.json, .csv
Updated Feb 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scrapehero (2022). ScrapeHero Data Cloud - Free and Easy to use [Dataset]. https://datarade.ai/data-products/scrapehero-data-cloud-free-and-easy-to-use-scrapehero
Explore at:
.json, .csvAvailable download formats
Dataset updated
Feb 8, 2022
Dataset provided by
ScrapeHero
Authors
Scrapehero
Area covered
Bhutan, Ghana, Bahamas, Dominica, Slovakia, Anguilla, Portugal, Niue, Chad, Bahrain
Description
The Easiest Way to Collect Data from the Internet Download anything you see on the internet into spreadsheets within a few clicks using our ready-made web crawlers or a few lines of code using our APIs

We have made it as simple as possible to collect data from websites

Easy to Use Crawlers Amazon Product Details and Pricing Scraper Amazon Product Details and Pricing Scraper Get product information, pricing, FBA, best seller rank, and much more from Amazon.

Google Maps Search Results Google Maps Search Results Get details like place name, phone number, address, website, ratings, and open hours from Google Maps or Google Places search results.

Twitter Scraper Twitter Scraper Get tweets, Twitter handle, content, number of replies, number of retweets, and more. All you need to provide is a URL to a profile, hashtag, or an advance search URL from Twitter.

Amazon Product Reviews and Ratings Amazon Product Reviews and Ratings Get customer reviews for any product on Amazon and get details like product name, brand, reviews and ratings, and more from Amazon.

Google Reviews Scraper Google Reviews Scraper Scrape Google reviews and get details like business or location name, address, review, ratings, and more for business and places.

Walmart Product Details & Pricing Walmart Product Details & Pricing Get the product name, pricing, number of ratings, reviews, product images, URL other product-related data from Walmart.

Amazon Search Results Scraper Amazon Search Results Scraper Get product search rank, pricing, availability, best seller rank, and much more from Amazon.

Amazon Best Sellers Amazon Best Sellers Get the bestseller rank, product name, pricing, number of ratings, rating, product images, and more from any Amazon Bestseller List.

Google Search Scraper Google Search Scraper Scrape Google search results and get details like search rank, paid and organic results, knowledge graph, related search results, and more.

Walmart Product Reviews & Ratings Walmart Product Reviews & Ratings Get customer reviews for any product on Walmart.com and get details like product name, brand, reviews, and ratings.

Scrape Emails and Contact Details Scrape Emails and Contact Details Get emails, addresses, contact numbers, social media links from any website.

Walmart Search Results Scraper Walmart Search Results Scraper Get Product details such as pricing, availability, reviews, ratings, and more from Walmart search results and categories.

Glassdoor Job Listings Glassdoor Job Listings Scrape job details such as job title, salary, job description, location, company name, number of reviews, and ratings from Glassdoor.

Indeed Job Listings Indeed Job Listings Scrape job details such as job title, salary, job description, location, company name, number of reviews, and ratings from Indeed.

LinkedIn Jobs Scraper Premium LinkedIn Jobs Scraper Scrape job listings on LinkedIn and extract job details such as job title, job description, location, company name, number of reviews, and more.

Redfin Scraper Premium Redfin Scraper Scrape real estate listings from Redfin. Extract property details such as address, price, mortgage, redfin estimate, broker name and more.

Yelp Business Details Scraper Yelp Business Details Scraper Scrape business details from Yelp such as phone number, address, website, and more from Yelp search and business details page.

Zillow Scraper Premium Zillow Scraper Scrape real estate listings from Zillow. Extract property details such as address, price, Broker, broker name and more.

Amazon product offers and third party sellers Amazon product offers and third party sellers Get product pricing, delivery details, FBA, seller details, and much more from the Amazon offer listing page.

Realtor Scraper Premium Realtor Scraper Scrape real estate listings from Realtor.com. Extract property details such as Address, Price, Area, Broker and more.

Target Product Details & Pricing Target Product Details & Pricing Get product details from search results and category pages such as pricing, availability, rating, reviews, and 20+ data points from Target.

Trulia Scraper Premium Trulia Scraper Scrape real estate listings from Trulia. Extract property details such as Address, Price, Area, Mortgage and more.

Amazon Customer FAQs Amazon Customer FAQs Get FAQs for any product on Amazon and get details like the question, answer, answered user name, and more.

Yellow Pages Scraper Yellow Pages Scraper Get details like business name, phone number, address, website, ratings, and more from Yellow Pages search results.
Data from: The Multilingual Amazon Reviews Corpus
registry.opendata.aws
Updated May 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2020). The Multilingual Amazon Reviews Corpus [Dataset]. https://registry.opendata.aws/amazon-reviews-ml/
Explore at:
Dataset updated
May 28, 2020
Dataset provided by
Amazon.comhttp://amazon.com/
Description
We present a collection of Amazon reviews specifically designed to aid research in multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. 'books', 'appliances', etc.)
h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
Most popular Apple App Store categories 2024
statista.com
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Most popular Apple App Store categories 2024 [Dataset]. https://www.statista.com/statistics/270291/popular-categories-in-the-app-store/
Explore at:
Dataset updated
Sep 30, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
As of the second quarter of 2024, gaming apps were the most popular category available in the Apple App Store, making up a share of approximately 12 percent of all iOS apps in hosted in the platform. Business apps were the second-most popular category in the Apple App Store with a share of 10.11 percent of active apps being apps in this category. Utilities apps were the third most popular iOS app category, accounting for a total of over 9.73 percent of active apps. iOS apps in the Apple App Store As nowadays it is inconceivable to imagine smartphones without the abundance of mobile applications, the number of available iOS mobile apps distributed via the Apple App Store has consistently grown, reaching the peak of 2.2 million available apps during the first quarter of 2021. As of the beginning of 2022, the highest-ranking mobile apps on the results page of the Apple App Store had an average of 31.7 thousand reviews, and received an average rating of 3.35 stars. Mobile gaming on the Apple App Store Gaming apps are the most popular apps based on availability and often rank as the most downloaded and highest-grossing apps. In 2024, The Honor of Kings was the top grossing gaming app, grossing more than nine million U.S. dollars worldwide. The average price for a gaming app is about 47 cents, while the average app price is just under one U.S. dollar. Common app revenue models include free apps with in-app purchases, paid apps without in-app purchases, and paid apps with in-app purchases. Gaming apps, which frequently generate revenue through in-game item purchases, were the top-grossing category on the Apple App Store, generating almost 52 billion U.S. dollars from global users in 2021.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2

Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\"

Explore at:

Unique identifier

https://doi.org/10.7910/DVN/EE3DE2

Dataset updated

Sep 25, 2024

Dataset provided by

Harvard Dataverse

Authors

Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert

Description

We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...

Clear search

Close search

Google apps

Main menu

Replication Data for: \"A Topic-based Segmentation Model for Identifying...

Same Sentiment Classification Train/Dev/Test Pair IDs

example access to single business/review id

ScrapeHero Data Cloud - Free and Easy to use

Data from: The Multilingual Amazon Reviews Corpus

Data from: imdb

Most popular Apple App Store categories 2024

Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\"