This first lab assignment will explore some of the ways of finding similarity in text.
Choose, or randomly select, 4 languages from the following list of Latin alphabet based languages:
Afrikaans, Albanian, Corsican, Dutch, French, Finnish, Frisian, German, Hungarian, Italian, Spanish, Latin, Norwegian, Swedish, Danish, Romanian, and Esperanto.
Create, find, or have an LLM generate, 15 interesting sentences totalling about 250 to 300 words in the English language. Paste this into your notebook and assign this content to a string variable with an appropriate name.
Use Google translate to convert the output from 2 into four separate translations in the languages you have selected, and store each of these in separate string variables with appropriate names.
Tokenize (construct k-gram representations) each of these 5 strings, as follows:
Character based 2-gram, 3-gram, and 4-gram shingles (consider a space as a character for this task.)
Single word-based tokens (you can ignore punctuation for this).
When tokenizing, ensure your token creating code only moves to the right one character at a time. The code I showed you in class will do that.
Report the number of distinct (unique) n-grams for each of these four tokenized representations for each of your languages (hint: use the set function).
Compute the Jaccard similarity coefficient between all 10 possible combinations of language strings for each type of k-gram, as well as the single word representation.
One way to implement the Jaccard coefficient in Python is:
def jaccard(a: set, b: set):
return len(a.intersection(b)) / len(a.union(b))
Report for each shingling approach, which two languages are the most similar.
Report the results in an understandable manner.
Offer some thoughts on why your results are the way they are (to do this well, you may want to explore the genealogy of languages a bit). If your results are not "explainable", offer your thoughts on why that might be, and what you could change in your workflow to remedy that.
Download your Jupyter Notebook and send it to me as an attachment to an email with the subject line: CS 4403 - Lab Assignment 1.
Submit this no later than 23:59 on Friday January 16, 2026