# COSC 2123/1285 Algorithms and Analysis Assignment 1: Word Completion Assessment Type Pair (Group of 2) Assignment. Submit online via Canvas → As- signments → Assignment 1. Clarifications/updates/FAQ...

COSC 2123/1285 Algorithms and Analysis
Assignment 1: Word Completion
Assessment Type Pair (Group of 2) Assignment. Submit online via Canvas → As-
signments → Assignment 1. Clarifications/updates/FAQ can be
found in Ed Discussion Forum → Assignment 1 General Discus-
sion.
Due Date Week 7, 11:59pm, Friday, April 22, 2022
Marks 30
1 Objectives
There are a number of key objectives for this assignment:
• Understand how a real-world problem can be implemented by different data structures and/o
algorithms.
• Evaluate and contrast the performance of the data structures and/or algorithms with respect to
different usage scenarios and input data.
In this assignment, we focus on the word completion problem.
2 Background
Word/sentence (auto-)completion is a very handy feature of nowadays text editors and email
owsers
(you must have used it in your Outlook). While sentence completion is a much more challenging task
and perhaps requires advanced learning methods, word completion is much easier to do as long as
you have a dictionary available in the memory. In this assignment, we will focus on implementing a
dictionary comprising of words and their frequencies that allows word completion. We will try several
data structures and compare their performances. One of these data structures is the Ternary Search
Tree, which is described below.
Ternary Search Trees
Ternary search trees (TST) is a data structure that allows memory-efficient storage of strings and fast
operations such as spell checking and auto-completion.
Each node of the a TST contains the following fields:
• a lower-case letter from the English alphabet (‘a’ to ‘z’),
• a positive integer indicating the word’s frequency (according to some dataset) if the letter is the
last letter of a word in the dictionary,
• a boolean variable that is True if this letter is the last letter of a word in the dictionary and False
otherwise,
• the left pointer points to the left child-node whose letter is smaller (in alphabetical order, i.e.
‘a’ < ‘b’),
• the right pointer points to the right child-node whose letter is larger,
• the middle pointer points to a node that stores the next letter in the word.
As an example, consider Figure 1.
Figure 1: An example of a ternary search tree storing five words and their frequencies. The boolean
value (T)rue indicates that the letter is the end of a word. In that case, a frequency (an integer) is
shown, e.g., 10 for ‘cut’. Note that a word can be a prefix of another, e.g., ‘cut’ is a prefix of ‘cute’.
Construction
A TST can be built by adding words to the tree one by one, noting that different orders of words can
lead to different trees. One can start from the root node (or create one when adding the first word)
and follow the left, middle, or right pointer depending on whether the cu
ent letter is smaller, equal
to, or larger than the letter of the node (in alphabetical order).
If the new word is a prefix of an existing word in the tree then we can simply change the boolean
field of the node storing the last letter to True. For instance, in the example in Figure 1, if (cut, 10)
is added after (cute: 50), then one can simply change the boolean field of the node containing the
letter ‘t’ to True and set the frequency to 10, which signifies that now (cut, 10) is part of the tree. In
another case, when (cup, 30) is added, then one new node has to be constructed to store ‘p’, which
has a True in the boolean field and 20 as its frequency. In this case, ‘cup’ can reuse the two nodes
2
storing ‘c’ and ‘u’ from the previous words. However, when (app, 20) and (farm, 40) are added, three
and four new nodes have to be created, respectively.
Searching
To search for a word (and to get its frequency) in a TST, we use its letters to navigate the tree. Fo
each letter of the word (from the first to the last), we follow either the left, middle, or right pointe
to the co
esponding child node depending on whether this letter is smaller than, equal to, or large
than the letter stored at the cu
ent node. Once we hit a node storing the same letter, we will follow
the middle pointer and start using the next letter of the word on the middle child node and repeat
the process.
The search fails, that is, the word is not in the tree, if either a) the search algorithm hits a Null
node while there is still at least one letter in the word that hasn’t been matched, or b) it finds that the
searched word is a prefix of an existing word in the tree but the boolean field of the node co
esponding
to the last letter is False.
Deletion
The deletion succeeds if the word is already included in the tree. If the word is a prefix of anothe
word then it can be deleted by simply setting the boolean field of the node storing the last letter to
False. For example, if (cut, 10) is to be deleted from the tree in Figure 1, then we only need to change
the boolean field of the node storing ‘t’ to False. Otherwise, if the word has some unique suffix, then
we need to delete nodes co
esponding to the suffix only. For example, if (cup, 30) is to be removed,
then only the node storing the last letter ‘p’ needs to be deleted from the tree.
Auto-completion
One simple way to achieve auto-completion from a partial word is to return three words in the dic-
tionary (tree) of highest frequencies that have the given partial word as a prefix. For example, in the
tree given in Figure 1,
• the auto-completion list for ‘cu’ is: [(cute, 50), (cup, 30), (cut, 10)],
• the auto-completion list for ‘f’ is: [(farm, 40)].
Suppose we add one more word (curiosity, 60) into this tree, then the auto-completion list of ‘cu’ will
e changed to [[(curiosity, 60), (cute, 50), (cup, 30)]. In this example, although ‘cut’ contains ‘cu’ as
a prefix, it is not in the top three of the most common words having ‘cu’ as a prefix. In general, the
auto-completion list contains either three, two, one, or no words. They must be sorted in a decreasing
frequencies, that is, the first word has the highest frequency and the last has the lowest frequency.
The assignment is
3.1 Task A: Implement the Nearest Neighbour Data Structures and their Oper-
ations (12 marks)
In this task, you will implement a dictionary of English words that allows auto-completion, using three
different data structures: Python’s list (a
ay), Python’s dictionary (hash table), and ternary search
tree. Each implementation should support the following operations:
3
• Build a dictionary from a list of words and frequencies: create a dictionary that stores words
and frequencies taken from a given list (using Add below).
• (A)dd a word and its frequency to the dictionary.
• (S)earch for a word in a dictionary and return its frequency (return 0 if not found).
• (D)elete a word from the dictionary.
• (AC)Auto-complete a given string and return a list of three most frequent words (if any) in the
dictionary that have the string as a prefix.
3.1.1 Implementation Details
List-Based Dictionary In this subtask, you will implement the dictionary using Python’s lists,
which are equivalent to a
ays in other programming languages. In this implementation, all operations
on lists are allowed. Other data structures should NOT be used directly in the main operations of the
list-based dictionary.
Hashtable-Based Dictionary In this subtask, you will use the native Python’s dictionary. In this
implementation, all standard Python operations on dictionaries are allowed.
Ternary-Search-Tree-Based Dictionary In this subtask, you will implement the ternary search
tree and use it to support auto-completion. Both iterative or recursive implementations are acceptable.
3.1.2 Operations Details
Operations to perform on the implementations are specified on the command file. They are in the
following format:
operation> [arguments]
where operation is one of {S, A, D, AC} and arguments is for optional arguments of some of the
operations. The operations take the following form:
• S word – searches for word in the dictionary and returns its frequency (returns 0 if not found).
• A word frequency – adds a new word and its frequency to the dictionary, returns True if
succeeded and False if the word already exists in the dictionary.
• D word – deletes word from the dictionary. If fails to delete (word is not in the dictionary),
eturns False.
• AC partial word – returns a list of three words of highest frequencies in the dictionary that
has partial word as a prefix. These words should be listed in a decreasing order of frequencies.
Maximum three words and minimum zero word will be returned.
As an example of the operations, consider the input and output from the provided testing files,
e.g., sampleDataToy.txt, testToy.in, and the expected output, testToy.exp (Table 1).
Note, you do NOT have to do the input and output reading yourself. The provided Python files will
handle the necessary input and output formats. Your task is to implement the missing methods in
the provided classes.
4
sampleDataToy.txt testToy.in (commands) testToy.exp (expected output)
cute 10 S cute Found ‘cute’ with frequency 10
ant 20 D cute Delete ‘cute’ succeeded
apple 300 A book 10000 Add ’book’ succeeded
cub 15 S book Found ‘book’ with frequency 10000
courage 1000 S apple Found ‘apple’ with frequency 300
annotation 5 D apple Delete ‘apple’ succeeded
furniture 500 D apple Delete ‘apple’ failed
find 400 AC c Autocomplete for ‘c’: [ calm: 1000 cuts: 50 cut: 30 ]
farm 5000 AC cut Autocomplete for ‘cut’: [ cuts: 50 cut: 30 ]
farming 1000 D cut Delete ‘cut’ succeeded
farmer 300 AC cut Autocomplete for ‘cut’: [ cuts: 50 ]
appendix 10 AC farms Autocomplete for ‘farms’: [ ]
apology 600
apologetic 1000
fur 10
fathom 40
apps 60
Table 1: The file sampleDataToy.txt provides the list of words and frequencies for the dictionary,
while testToy.in and testToy.exp have the list of input commands and expected output. Fo
instance, as ‘cute’ belongs to the dictionary, the expected outcome of “S cute” should be “Found
‘cute’ with frequency 10” and “D cute” should be successful. After deleting ‘cute’, “S cute”
should return “NOT Found ‘cute’”. Note that although there are more than three words having ‘c’
as a prefix, “AC c” only