This fourth Lab Assignment will give you practice in text mining using an embeddings based approach.
The OpinRank Hotel Reviews Dataset is a dataset containing a large number of user reviews for cars and hotels. I have extracted the reviews for a subset of hotels and done some minimal extraction of the textual bits, and put those into a CSV file. You can download it here.
Load the dataset into a dataframe and extract the details column.
Then write code to do the following steps:
Convert text to lowercase.
Remove punctuation, stop words, words shorter than 2 characters, and non-alphabetic characters (including numbers).
Tokenize the text.
Train a Word2Vec model with the following suggested parameters (feel free to experiment):
Vector size: 150
Window size: 10
Minimum word count: 5
Explore the model and find the 10 most popular words in the dataset. Display these words.
Write a function that takes a word as input and returns the top 5 most similar words based on cosine similarity. For example:
words_similar_to('clean')
words_similar_to('service')
Test your function with at least 5 different words.
(You will find https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html useful for this step.)
Create a function that takes in three words (two negative, and one positive), and returns the 'vector arithmetic' answer (hint: use the above link again).
Using some prevalent words in the dataset, think up three interesting conceptual questions to explore and use your function to test them.
An example is (+ staff + friendly - rude).
UPDATE: It has rightfully been drawn to my attention that my example does not match my instructions. You are free to either use two positives and one negative or vice versa. Apologies.
Send me your Jupyter Notebook containing your code and output. Be sure to include descriptive commentary where appropriate.
Data cleaning and preprocessing of text data - 1 point.
Correct implementation of Word2Vec - 1 point.
Depth of exploration in similar words and word relationships - 2 point.
Clarity of code and explanations - 1 point.