Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission of final project proposal for Group U #1

Open
Kimberlyshan opened this issue Mar 2, 2022 · 1 comment
Open

Submission of final project proposal for Group U #1

Kimberlyshan opened this issue Mar 2, 2022 · 1 comment

Comments

@Kimberlyshan
Copy link
Collaborator

@QMSS-G5063-2022/teaching_team

SHA: e4784df

@JonathanReeve
Copy link

My suggestions for this would be:

  • Narrow your datasets. I'd recommend choosing either Twitter tweets or Reddit posts, since each medium carries its own linguistic style and user base. It would overcomplicate things to be comparing them.
  • Think about how you'll handle language. If you're only looking at English-language posts, you're excluding quite a lot of important opinion that would be written in Russian or Ukrainian. Not to mention French, German, Polish, Romanian, and so on. So that's going to color a lot of the sentiment you're analyzing.
  • Sentiment analysis and making word clouds are not the same. Do some thinking about what the word cloud is doing, if you decide to go with that visualization. It usually throws out stopwords and shows you a quilt of remaining words, where size is correlated with frequency. But can you do better? I think if you think about what sorts of things you're interested in measuring, you can do better than this out-of-the-box solution.
  • If you really want to do sentiment analysis, maybe look into some sentiment analysis packages for R, or the the nltk.sentiment package in Python. Multilingual sentiment analysis might be a bit more difficult, but you could probably accomplish this to a certain extent with lexical approach, if you find the right word lists for your target languages.
  • Some questions I might have for your data set would include:
    • How is a Reddit user's sentiment about the war correlated with the other subreddits in which they post? If a user subscribes to lots of right-wing subreddits, for instance, does that make him more or less likely to have certain opinions?
    • Similarly, what else does a Twitter user post about, that is not about the war? How do their opinions about the war correlate with their other opinions, about other things entirely?
    • What kinds of expressions are correlated with certain sentiments? For example, if you see the expression "special military operation" (the official Russian phrase for the war), what kinds of sentiments are conveyed?
    • Does other metadata correlate with sentiment used? For instance, are anti-Ukrainian tweets happening between 9am and 5pm, Moscow time? (This would really suggest that they're tweets paid for by the Russian government.)

Let me know if you have any questions regarding NLP-related tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants