MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context
Travel is a diverse and vibrant industry, and India, with its rich cultural heritage and varied landscapes, offers a myriad of experiences for travelers. The India Travel Recommender System Dataset is designed to facilitate the development of personalized travel recommendation systems. This dataset provides an extensive compilation of travel destinations across India, along with user profiles, reviews, and historical travel data. It's an invaluable resource for anyone looking to create AI-powered travel applications focused on the Indian subcontinent.
Content
The dataset is divided into four primary components:
Destinations: Information about various travel destinations in India, including details like type of destination (beach, mountain, historical site, etc.), popularity, and best time to visit.
Users: Profiles of users including their preferences and demographic information. This dataset has been enriched with gender diversity and includes details on the number of adults and children for travel.
Reviews: User-generated reviews and ratings for the different destinations, offering insights into visitor experiences and satisfaction.
User History: Records of users' past travel experiences, including destinations visited and ratings provided.
Each of these components is presented in a separate CSV file, allowing for easy integration and manipulation in data processing and machine learning workflows.
Acknowledgements
This dataset was generated for educational and research purposes and is intended to be used in hackathons, academic projects, and by AI enthusiasts aiming to enhance the travel experience through technology.
Inspiration
The dataset is perfect for exploring a variety of questions and tasks, such as:
- Building a recommendation engine to suggest travel destinations based on user preferences.
- Analyzing travel trends in India.
- Understanding the relationship between user demographics and travel preferences.
- Sentiment analysis of travel destination reviews.
- Forecasting the popularity of travel destinations based on historical data.
We encourage Kaggle users to explore this dataset to uncover unique insights and develop innovative solutions in the realm of travel technology. Whether you're a data scientist, a student, or a travel tech enthusiast, this dataset offers a wealth of opportunities for exploration and creativity.
This dataset is free to use for non-commercial purposes. For commercial use, please contact the dataset provider. Remember to cite the source when using this dataset in your projects.
CC0: Public Domain - The dataset is in the public domain and can be used without restrictions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To read any dataset you can use the following code
>>> import numpy as np
>>> embed_image = np.load('embed_image.npy')
>>> embed_image.shape
(33962, 768)
>>> embed_text = np.load('embed_text.npy')
>>> embed_text.shape
(33962, 768)
>>> import pandas as pd
>>> items = pd.read_csv('items.txt')
>>> m = len(items)
>>> print(f'{m} items in dataset')
33962
>>> users = pd.read_csv('users.txt')
>>> n = len(users)
>>> print(f'{n} users in dataset')
14790
>>> train = pd.read_csv('train.txt')
>>> train
user item
0 13444 23557
1 13444 33739
... ... ...
317109 13506 29993
317110 13506 13931
>>> from scipy.sparse import csr_matrix
>>> train_matrix = csr_matrix((np.ones(len(train)), (train.user, train.item)), shape=(n,m))
This dataset contains six datasets. Each dataset is duplicated with seven combinations of different Image and Text encoders, so you should see 42 folders.
Each folder is the name of the dataset and the encoder used for the visual and textual parts. For example: bookcrossing-vit_bert
.
The datasets are: - Clothing, Shoes and Jewelry (Amazon) - Home and Kitchen (Amazon) - Musical Instruments (Amazon) - Movies and TV (Amazon) - Book-Crossing - Movielens 25M
And the encoders are:
- CLIP (Image and Text) (*-clip_clip
). This is the main one used in the experiments.
- ViT and BERT (*-vit_bert
)
- CLIP (only visual data) *-clip_none
- ViT only *-vit_none
- BERT only *-none_bert
- CLIP (text only) *-clip_none
- No textual or visual information *-none_none
For each dataset, we have the following files, considering we have M
items and N
users, textual embeddings with D (like 1024) dimensions, and Visual with E dimensions (like 768)
- embed_image.npy
A NumPy array of MxE
elements.
- embed_text.npy
A NumPy array of MXD
elements.
- items.csv
A CSV with the Item ID in the original dataset (like the Amazon ASIN, the Movie ID, etc.) and the item number, an integer from 0 to M-1
- users.csv
A CSV with the User ID in the original dataset (like the Amazon Reviewer Id) and the item number, an integer from 0 to N-1
- train.txt
, validation.txt
and test.txt
are CSV files with the portions of the reviews for train validation and test. It has the item the user liked or reviewed positively. Each row has a positive user item.
We consider a review "positive" if the rating is four or more (or 8 or more for Book-crossing).
The vector is zeroed out if an Item does not have an image or text.
Dataset | Users | Item | Ratings | Density |
---|---|---|---|---|
Clothing & Shoes & Jewelry | 23318 | 38493 | 178944 | 0.020% |
Home & Kitchen | 5968 | 57645 | 135839 | 0.040% |
Movies & TV | 21974 | 23958 | 216110 | 0.041% |
Musical Instruments | 14429 | 29040 | 93923 | 0.022% |
Book-crossing | 14790 | 33962 | 519613 | 0.103% |
Movielens 25M | 162541 | 59047 | 25000095 | 0.260% |
Only a tiny fraction of the dataset was taken for the Amazon Datasets by considering reviews in a specific date range.
For the Bookcrossing dataset, only items with images were considered.
There are various other minor tweaks on how to obtain images and texts. The repo https://github.com/igui/MultimodalRecomAnalysis has the Notebook and scripts to reproduce the dataset extraction from scratch.
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.
Metadata includes
reviews
add-to-shelf, read, review actions
book attributes: title, isbn
graph of similar books
Basic Statistics:
Items: 1,561,465
Users: 808,749
Interactions: 225,394,930
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.
Metadata includes
product IDs
bounding boxes
Basic Statistics:
Scenes: 47,739
Products: 38,111
Scene-Product Pairs: 93,274
These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews).
Metadata includes
reviews
price paid (epinions)
helpfulness votes (librarything)
flags (librarything)
This dataset was created by Noor Saeed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Movie Recommender System Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/gargmanas/movierecommenderdataset on 12 November 2021.
--- Dataset description provided by original source is as follows ---
Build a Movie Recommender System using the dataset available.
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Recommendation systems are used everywhere nowadays. Netflix, Amazon Prime, YouTube, Online shopping sites, etc. Datasets like this are a great way to start working on a Recommendation system. The Dataset was created from the official API provided by TMDB.
What's inside is more than just rows and columns. This is the dataset for 10,000 Popular movies based on the TMDB ratings. Ideal database to start off with Recommendation algorithms.
Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content-Based and Collaborative Filtering Based Recommendation Engines.
This dataset was generated from The Movie Database API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.
These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.
Metadata includes
reviews
purchases, plays, recommends (likes)
product bundles
pricing information
Basic Statistics:
Reviews: 7,793,069
Users: 2,567,538
Items: 15,474
Bundles: 615
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Take a look at the Recommender System Movies Kernel for the EDA of the dataset.
These datasets contain peer-to-peer trades from various recommendation platforms.
Metadata includes
peer-to-peer trades
have and want lists
image data (tradesy)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
These files contain metadata for over 20,000 movies listed in the Full TMDB Dataset. The dataset consists of movies released on or before August 2022 as well as some of the upcoming movies till Dec 2028. Data points include title, release dates, languages, genre, popularity, TMDB vote counts, and vote averages.
The Movie Details have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.
This dataset is assembled as part of my Project for Recommender Systems. I wanted to perform an extensive EDA on Movie Data to build various types of Recommender Systems.
These datasets contain attributes about products sold on ModCloth and Amazon which may be sources of bias in recommendations (in particular, attributes about how the products are marketed). Data also includes user/item interactions for recommendation.
Metadata includes
ratings
product images
user identities
item sizes, user genders
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.
Metadata includes
appreciates (likes)
timestamps
extracted image features
Basic Statistics:
Users: 63,497
Items: 178,788
Appreciates (likes): 1,000,000
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Ritik Kumar
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Four multimedia recommender systems datasets to study popularity bias and fairness:
Last.fm (lfm.zip), based on the LFM-1b dataset of JKU Linz (http://www.cp.jku.at/datasets/LFM-1b/)
MovieLens (ml.zip), based on MovieLens-1M dataset (https://grouplens.org/datasets/movielens/1m/)
BookCrossing (book.zip), based on the BookCrossing dataset of Uni Freiburg (http://www2.informatik.uni-freiburg.de/~cziegler/BX/)
MyAnimeList (anime.zip), based on the MyAnimeList dataset of Kaggle (https://www.kaggle.com/CooperUnion/anime-recommendations-database)
Each dataset contains of user interactions (user_events.txt) and three user groups that differ in their inclination to popular/mainstream items: LowPop (low_main_users.txt), MedPop (med_main_users.txt), and HighPop (high_main_users.txt).
The format of the three user files are "user,mainstreaminess"
The format of the user-events files are "user,item,preference"
Example Python-code for analyzing the datasets as well as more information on the user groups can be found on Github (https://github.com/domkowald/FairRecSys) and on Arxiv (https://arxiv.org/abs/2203.00376)
These datasets contain 1.48 million question and answer pairs about products from Amazon.
Metadata includes
question and answer text
is the question binary (yes/no), and if so does it have a yes/no answer?
timestamps
product ID (to reference the review dataset)
Basic Statistics:
Questions: 1.48 million
Answers: 4,019,744
Labeled yes/no questions: 309,419
Number of unique products with questions: 191,185
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Vinothkumar J
Released under Apache 2.0
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
This is a collection recipes paired with variants, e.g. a recipe matched with a vegan version of the same recipe.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context
Travel is a diverse and vibrant industry, and India, with its rich cultural heritage and varied landscapes, offers a myriad of experiences for travelers. The India Travel Recommender System Dataset is designed to facilitate the development of personalized travel recommendation systems. This dataset provides an extensive compilation of travel destinations across India, along with user profiles, reviews, and historical travel data. It's an invaluable resource for anyone looking to create AI-powered travel applications focused on the Indian subcontinent.
Content
The dataset is divided into four primary components:
Destinations: Information about various travel destinations in India, including details like type of destination (beach, mountain, historical site, etc.), popularity, and best time to visit.
Users: Profiles of users including their preferences and demographic information. This dataset has been enriched with gender diversity and includes details on the number of adults and children for travel.
Reviews: User-generated reviews and ratings for the different destinations, offering insights into visitor experiences and satisfaction.
User History: Records of users' past travel experiences, including destinations visited and ratings provided.
Each of these components is presented in a separate CSV file, allowing for easy integration and manipulation in data processing and machine learning workflows.
Acknowledgements
This dataset was generated for educational and research purposes and is intended to be used in hackathons, academic projects, and by AI enthusiasts aiming to enhance the travel experience through technology.
Inspiration
The dataset is perfect for exploring a variety of questions and tasks, such as:
- Building a recommendation engine to suggest travel destinations based on user preferences.
- Analyzing travel trends in India.
- Understanding the relationship between user demographics and travel preferences.
- Sentiment analysis of travel destination reviews.
- Forecasting the popularity of travel destinations based on historical data.
We encourage Kaggle users to explore this dataset to uncover unique insights and develop innovative solutions in the realm of travel technology. Whether you're a data scientist, a student, or a travel tech enthusiast, this dataset offers a wealth of opportunities for exploration and creativity.
This dataset is free to use for non-commercial purposes. For commercial use, please contact the dataset provider. Remember to cite the source when using this dataset in your projects.
CC0: Public Domain - The dataset is in the public domain and can be used without restrictions.