-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Dataset - Review domain #74
Comments
@soroush-ziaeinejad |
Applying SEERa to this dataset would result in predicting future user communities based on their aspects of interest. For instance, it is possible to predict that a user will be interested in softness (based on her reviews of bedroom sets), simplicity in assembling (based on her reviews of furniture), and packaging. With this information, reviews can be re-ranked for a specific user, bringing up reviews from users in the same community as them. Additionally, extracting more information such as future communities based on their items of interest (their purchases) or conducting sentiment analysis on their reviews, and adding more information such as their ratings to different items, can improve the performance of downstream tasks such as item recommendation or review re-ranking. |
The Amazon Reviews dataset is integrated into SEERa, functioning up to the Community Prediction Layer. However, I require your assistance with the evaluation phase. For the Twitter dataset, we compared the news articles mentioned by users with the recommended news articles. Nevertheless, such data is unavailable in this dataset. I propose two solutions: I suggest altering the definition of topics for this dataset. We can define a topic as: Let's assume a user's topic vector is [0.1, 0.04, 0.2, 0.05, 0.83], indicating an interest in sports video games or FIFA. In this case, we can expect that the user is a fan of sports video games and will be interested in future sports video games. Therefore, we recommend all video games related to Z5, and if the user writes a review for any of them, we count it as a hit. By employing this approach, we can evaluate the performance of the recommendation system based on the alignment between the users' topics of interest and the topics reflected in the reviews of other items. Hits occur when users from a community post reviews that correspond to the anticipated topics, indicating successful recommendations. On the other hand, misses indicate instances where users did not provide reviews matching the expected topics. I apologize if you found this comment confusing. I can explain more in our weekly meeting. |
@hosseinfani |
|
This issue page has been created for discussion about adding a new dataset. After some search on the internet and specifically on this website, I found Amazon Reviews dataset better than other existing datasets because of these reasons:
1- It is for Amazon and it includes different categories (good for our topic modeling step)
2- Amazon is one of the most common review platforms, and the dataset is well-known and trustworthy
3- It can be considered as a recent dataset and reviews are collected up to 2018
4- It includes a range of reviews from 1996 to 2018, enabling us to add more temporal-related contributions
5- It has a version called 5-core which is a subset of the data in which all users and items have at least 5 reviews for avoiding sparsity
6- It also includes metadata information of all items in the reviews
7- Keys of each record of the dataset include but are not limited to: reviewerID, reviewText, summary, and reviewTime
Other possible datasets are listed below:
The text was updated successfully, but these errors were encountered: