Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Dataset - Review domain #74

Open
soroush-ziaeinejad opened this issue May 4, 2023 · 5 comments
Open

New Dataset - Review domain #74

soroush-ziaeinejad opened this issue May 4, 2023 · 5 comments

Comments

@soroush-ziaeinejad
Copy link
Contributor

This issue page has been created for discussion about adding a new dataset. After some search on the internet and specifically on this website, I found Amazon Reviews dataset better than other existing datasets because of these reasons:
1- It is for Amazon and it includes different categories (good for our topic modeling step)
2- Amazon is one of the most common review platforms, and the dataset is well-known and trustworthy
3- It can be considered as a recent dataset and reviews are collected up to 2018
4- It includes a range of reviews from 1996 to 2018, enabling us to add more temporal-related contributions
5- It has a version called 5-core which is a subset of the data in which all users and items have at least 5 reviews for avoiding sparsity
6- It also includes metadata information of all items in the reviews
7- Keys of each record of the dataset include but are not limited to: reviewerID, reviewText, summary, and reviewTime

Other possible datasets are listed below:

@hosseinfani
Copy link
Member

@soroush-ziaeinejad
also explain how seera will be applied and what the interperetation of the output would be, based on reviews.

@soroush-ziaeinejad
Copy link
Contributor Author

Applying SEERa to this dataset would result in predicting future user communities based on their aspects of interest. For instance, it is possible to predict that a user will be interested in softness (based on her reviews of bedroom sets), simplicity in assembling (based on her reviews of furniture), and packaging. With this information, reviews can be re-ranked for a specific user, bringing up reviews from users in the same community as them. Additionally, extracting more information such as future communities based on their items of interest (their purchases) or conducting sentiment analysis on their reviews, and adding more information such as their ratings to different items, can improve the performance of downstream tasks such as item recommendation or review re-ranking.

@soroush-ziaeinejad
Copy link
Contributor Author

@hosseinfani

The Amazon Reviews dataset is integrated into SEERa, functioning up to the Community Prediction Layer. However, I require your assistance with the evaluation phase.

For the Twitter dataset, we compared the news articles mentioned by users with the recommended news articles. Nevertheless, such data is unavailable in this dataset. I propose two solutions:
1- We can recommend users' reviews to their community and observe if other community members vote (like) those reviews. This approach is straightforward and logical, but our current dataset lacks information on users' votes.
2- The second approach involves examining the users' product ratings within their community. However, since we lack knowledge of whether a user posted a positive or negative review (which can be found out by their rating of that product) for that product, our results may be inconclusive. Consider this example:
User1: "Product1 is awesome, with different colors. I highly recommend it." Rating: 5
User2: "Product1 is terrible. You cannot find different colors. I don't recommend it." Rating: 1
These two users will likely be placed in the same communities, and comparing their ratings would not make sense.
Moreover, for learning rating patterns, several methods exist that make more sense than reading reviews. However, by analyzing reviews, we can extract patterns influencing user ratings and predict future user ratings.
Additionally, in the given example, both users care about color. So, if it pertains to color, they may post a review. The question is, would User1 post a review if the colors were limited? And vice versa for User2. I propose both a yes and a no:
YES: They care about color, so they express their opinions on colors through their reviews.
NO: User1 anticipated limited colors and would have been content with just one color for the product. However, the availability of color options excited her, prompting her to post a review. User2 expected different colors, and discovering only one color infuriated him, leading to a negative review.
CONCLUSION: It is not solely about their preferences; it is about their expectations. If something significantly deviates from their expectations, they will likely post a review, and the tone of their review is determined by the alignment between their expectations and reality. However, we have users like local guides who review nearly everything they purchase. In such cases, we can assume that they base their reviews on their areas of interest rather than expectations.

I suggest altering the definition of topics for this dataset. We can define a topic as:
1- Users' interests in various aspects. In this scenario, the topics would be:
Z1: Soft, comfy, relaxing
Z2: Stable, firm, reliable
Z3: Color, classy, modern
Z4: Easy, assemble, tutorial
Z5: Affordable, budget-friendly, value
In this case, a topic vector for users explains the topics that they care about. So, we can expect that a user will post a review on items that currently have reviews on these points of interest. The problem is sparsity which can be solved by only evaluating items that the user has rated so far.
2- We can run the pipeline on each category (let's say 8 categories) and identify different topics (let's say 5 topics) for each category. Then, on T+1, we can analyze reviews for all items in that category. If the future reviews align with a community's topics of interest, we can anticipate that users in that community will post reviews. If they do, it is considered a hit; otherwise, a miss. For instance, for the Video Games category, the topics could be:
Z1: Huge, download, internet
Z2: Fun, music, relaxing
Z3: Competitive, shooter, multiplayer
Z4: Challenging, puzzle, boring
Z5: Sports, FIFA, streaming

Let's assume a user's topic vector is [0.1, 0.04, 0.2, 0.05, 0.83], indicating an interest in sports video games or FIFA. In this case, we can expect that the user is a fan of sports video games and will be interested in future sports video games. Therefore, we recommend all video games related to Z5, and if the user writes a review for any of them, we count it as a hit.

By employing this approach, we can evaluate the performance of the recommendation system based on the alignment between the users' topics of interest and the topics reflected in the reviews of other items. Hits occur when users from a community post reviews that correspond to the anticipated topics, indicating successful recommendations. On the other hand, misses indicate instances where users did not provide reviews matching the expected topics.

I apologize if you found this comment confusing. I can explain more in our weekly meeting.

@soroush-ziaeinejad
Copy link
Contributor Author

@hosseinfani
Here are some stats of the Amazon Reviews dataset for a specific category (Musical Instruments) for two months (1-11-2016 till 30-12-2016):
PostsPerDay
UniqueUsersPerDay
PostsPerUser

@hosseinfani
Copy link
Member

@soroush-ziaeinejad

  • Yes, we cannot use the rating since we don't consider the polarity

  • True, we can consider the topics as aspect/opinionated aspects of products

  • To evaluate the communities in T+1, the evaluation protocol should not follow the recommendation or community detection method. We need an absolute gold/silver answer. As we discussed, we can recommend products to communities based on their aspects, then see whether the majority of community members have reviews linked to the recommended product. This link is explicit (not infered)

  • For future, we can also consider the questions that are answered by the users. That is we recommend question to commutes as the experts to answer. If the majority of a community has answered the recommended questions, we hit.

  • About the diagrams, we need to compare the distribution with twitter domain next to each other. then we can have a better insight about the differences or similarities in these domains and claim our method is domain-agnostic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants