This first lab assignment will explore some of the ways of finding similarity in text.
I used an LLM to generate a bunch of sentences of various lengths and in various languages:
Import the German sentences file into a Python Notebook for this assignment. Have a look at the data to see what you are dealing with.
Now construct k-gram representations for each sentence, as follows:
Character based 2-grams, 3-grams, and 4-grams. (A space is a character)
Single word based tokens.
Report the number of distinct k-grams within the file, for each of these four sentence representations.
Compute the Jaccard similarity between all pairs of sentences for each type of k-gram, as well as the single word representation. One way to implement the Jaccard coefficient in Python is:
def jaccard(a: set, b: set):
return len(a.intersection(b)) / len(a.union(b))
Report for each shingling approach, which three sentence pairs are the most similar.
Import the English, French, and Spanish sentences files into your Python Notebook.
Your goal is to assess which two of these three languages is most similar, based on these very small samples. Examine the datasets and develop a strategy for finding this similarity, using some sort of shingling approach.
Implement and test this strategy. Report the results.
Download your Jupyter Notebook and send it to me as an attachment to an email, by the deadline in the Course Schedule.