This fifth and last Lab Assignment will give you practice in text embeddings and recommendation systems.
In this lab you will build a very simple content-based book recommendation system powered by text embeddings. Along the way you will explore how raw text is transformed into dense vectors, measure semantic similarity, and see how the same embeddings that drive NLP tasks can be repurposed directly as features in a recommender system.
Text Embeddings
As we have learned, an embedding is a function that maps a discrete object (a word, sentence, or document) to a point in a continuous vector space ℝⁿ. Good embeddings are arranged so that semantically similar objects land close together. For example, the vectors for “king” and “queen” are closer to each other than either is to “car”.
Modern sentence embeddings (e.g. from models like Sentence-BERT) encode an entire piece of text into a single fixed-length vector by passing it through a pre-trained transformer and pooling the token representations. This gives us a compact, comparable representation of arbitrary text.
Cosine Similarity
Given two vectors u and v, cosine similarity measures the angle between them. A value of 1 means identical direction (most similar); 0 means orthogonal (unrelated); −1 means opposite. For text embeddings the range is typically [0, 1].
Content-Based Filtering
A content-based recommender recommends items that are similar in content to items the user has already liked. When items are described by text (titles, descriptions, reviews), embeddings give us an automatic way to define “similar in content” without any manual feature engineering.
All work for this lab must be done in a single Jupyter notebook. Items identified below with bullets require answers in markdown cells, like this:
Put your name in a markdown cell at the top of your notebook.
Ensure your notebook has access to the following packages:
sentence-transformers scikit-learn pandas numpy
Dataset
You will work with a small, hand-crafted dataset of 12 books. Each book has a title, genre, and a one-paragraph description. Copy the following into your notebook:
books = [{"id": 1, "title": "Dune", "genre": "Sci-Fi", "description": "A desert planet holds the most precious resource in the universe. A noble family is drawn into a deadly political and religious conflict over its control."},
{"id": 2, "title": "Foundation","genre": "Sci-Fi", "description": "A mathematician predicts the fall of civilization and secretly plans to shorten the coming dark age through a long-term scientific project."},
{"id": 3, "title": "Neuromancer", "genre": "Sci-Fi", "description": "A washed-up hacker is hired for one last heist in a neon-lit future of corporate espionage, artificial intelligence, and cyberspace."},
{"id": 4, "title": "The Name of the Wind", "genre": "Fantasy", "description": "A legendary wizard recounts his extraordinary life, from a childhood among traveling performers to his years at a magical university."},
{"id": 5, "title": "The Way of Kings", "genre": "Fantasy", "description": "On a world ravaged by storms, a slave, a scholar, and a reluctant warrior are each drawn into an ancient conflict that will decide humanity's fate."},
{"id": 6, "title": "The Hobbit", "genre": "Fantasy", "description": "A reluctant homebody is swept into a grand adventure with a company of dwarves seeking to reclaim their mountain home from a fearsome dragon."},
{"id": 7, "title": "The Martian", "genre": "Sci-Fi","description": "An astronaut is accidentally stranded on Mars and must use ingenuity and dark humor to survive until a rescue mission can reach him."},
{"id": 8, "title": "Ender's Game", "genre": "Sci-Fi", "description": "A child prodigy is trained at a remote battle school to become the commander Earth needs to defeat an alien invasion."},
{"id": 9, "title": "A Wizard of Earthsea", "genre": "Fantasy", "description": "A young boy with raw magical talent enrolls in a school for wizards and must hunt down the shadow creature he accidentally unleashed."},
{"id": 10, "title": "Recursion", "genre": "Thriller", "description": "A neuroscientist and a police detective uncover a memory-altering technology that is quietly rewriting people's pasts and destabilizing reality."},
{"id": 11, "title": "Gone Girl", "genre": "Thriller", "description": "When a woman vanishes on her anniversary, her charming husband becomes the prime suspect, but both have been keeping dangerous secrets."},
{"id": 12, "title": "The Girl with the Dragon Tattoo", "genre": "Thriller", "description": "A disgraced journalist and a brilliant hacker investigate a decades-old disappearance within a wealthy and deeply dysfunctional Swedish family."},]
How many books belong to each genre? Use pandas to count. Pick two books whose descriptions you would intuitively expect to be similar, and two you would expect to be very different.
Write down your predictions in a markdown cell.
Load a pre-trained sentence embedding model and embed all 12 descriptions:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['description'].tolist())
print('Embedding shape:', embeddings.shape)
Answer the following questions in a markup cell:
Research the model used. How was it created, and what is it good at?
What is the shape of the embeddings array?
What does each dimension represent?
The model name all-MiniLM-L6-v2 is a Sentence-BERT variant.
What does “sentence-level” embedding mean, as opposed to “word-level” embedding?
Do some research on other sentence embedding models and select one of these which your research reveals will also work well for our use case. Implement it in your code.
Describe why you chose the other model.
Save the embeddings from both models to variables for use in later tasks.
Use an LLM to help you identify the code required to compute the full pairwise cosine similarity matrix for this small dataset, rounded to 3 digits, and add this code, as a function, to your notebook and test it. Verify that the function works correctly!
In a markdown cell identify the LLM you used and what your prompt was.
Find the pair of books with the highest similarity score (excluding each book with itself) using both of your models. For each:
Do these books make intuitive sense as similar? Why or why not?
Find the pair with the lowest similarity score.
Are you surprised?
Revisit the predictions you made in Task 1.
Were you correct?
Where did your intuition diverge from the models' scores?
What is always true about the diagonal of the similarity matrix, and why? (if you don't know this answer, you can use an LLM. Again, note the LLM you used and the prompt you submitted. Be sure to verify the answer provided!)
Write a function that, given a book title, returns the top-k most similar books, using the function from step 3.
Call your function with at least three different query books. At least one from each genre.
Are the recommendations always from the same genre as the query? Explain why or why not.
Add a parameter to your function that filters recommendations to only include books from a specified genre. Show an example call.
How would you modify this system to recommend based on a user’s entire reading history (multiple liked books) rather than a single query?
Try the recommender using both of your models.
Are there differences? If so, comment on why this might be the case.
The system doesn’t have to be limited to pre-existing descriptions. You can embed arbitrary text and query the catalog with it.
List three of your favourite books (not already on the list) and use an online resource such as goodreads or amazon, combined with an LLM, to curate a short, original description (2–3 sentences) for your three books. Show me your prompt and the results you got.
Embed your descriptions and compute their cosine similarity to all 12 catalog books, again using both of your models.
Examine the results and comment on whether they feel right to you.
Answer the following in a few sentences each:
What information is captured by a sentence embedding that a simple bag-of-words TF-IDF vector would miss?
This system is content-based. What data would you need to build a collaborative filtering system instead, and what are the trade-offs between the two approaches?
The model we used was pre-trained on general text. How might recommendations change if you fine-tuned it on a large corpus of book reviews?
Name one bias that could be introduced by using description-based embeddings, and suggest how you might mitigate it.
With millions of items, computing all pairwise cosine similarities becomes infeasible. Name one technique that addresses this scalability problem.
In addition to the 3 points which this lab is worth, I will award up to 5 participation points for submissions which exceed expectations.
To get (some) of these points, build on the lab by adding better/more data, implementing another recommendation approach or system, or by using different modeling. It is up to you.
The more innovative, the higher the potential for more points!
Once done, download your Jupyter Notebook with your all your above work and results, and send it to me as an attachment to an email with the subject line: CS 4403 - Lab Assignment 5.
Submit this no later than 23:59 on Thursday 26 March, 2026.
end me your Jupyter Notebook containing your code, output, and answers. Be sure to include descriptive comments in your code where appropriate.