(Over the past six months, I've conducted over twenty in-house interviews at Amazon. What follows is the answer to one of the questions I've commonly asked, which is now retired from my list. This makes it imminently boring to non-technical types.)
Lead-in:Can you tell me the best way to store a dictionary?
The candidate usually identifies why we want to store the dictionary correctly - to look up words later - and invariably they come to the same conclusion as previous candidates before them: a hashtable.
At this point, I we talk about the benefits of hashtables - the average lookup time is great. The worst lookup time is horrible, as are the space requirements, which makes a good solution only for certain cases. More conversation is had about different architectures (server, desktop, small machine, mobile, etc) and their differing requirements.
Then I drop a twist on them: the dictionary data structure must fit in less memory than the space the dictionary file uses on disk. After a few minutes of staring at me, the candidate's jaw usually begins to slacken.
At this point, we talk about tries, a data structure that is designed to store information about sets of things (sets of letter in the case of words), especially their existence or non-existence. Essentially, it's a tree of nodes, with each node having a map that connects a letter with a child node. In any given node, you can mark it as an 'end of word' (or 'end of word' in the dictionary case) - any search that ends on a marked node should return 'true' for the existence of that word.
A trie has properties allow you to take advantage of the fact that most words start with a common prefix (think 'presume', 'pressure', 'present', 'presidential' - they all start with 'pres'.) All of the words with common prefixes get compressed down into one path on the tree, which means you can fit a lot of words in relatively little space.
Follow-up: OK, so we know what a trie is - can you code one on the whiteboard?
I've had varying degrees of success with this question - most people can't code on a whiteboard and it shows when they try. That's a topic for another time- right now, let's concentrate on implementing the trie in Java.
Start with the basics - a Trie must be able to do two things - insert a new word and look up the existence of a word. Let's prototype those methods:
public class Trie {
public void insert(String data) {
}
public boolean contains(String data) {
return false;
}
}
So far so, good. Some candidates choose to make Trie and Node a separate class - that's a matter of style the I don't distinguish as different. I choose to implement it in one class because it is easier overall.
Let's take part of the description and make it real: "a trie has a map of characters to child nodes" and "has an 'end of word' marker"
public class Trie {
private Mappaths=new HashMap ();
private boolean isWord=false;
public void insert(String data) {
}
public boolean contains(String data) {
return false;
}
}
Here, I usually dock the candidate if they don't use byte, don't use a boolean for the marker, and especially if they don't know how to build a map in Java. Generics are optional, but the bonus on this problem requires it and most candidates forget the cast from Object to Trie when they look up nodes later.
Next, we have to implement the insert method. This is essentially picking off the first byte, looking up to see if there is a path present, creting one if not, and repeating until we hit the end of the string. Once we're at the end, we set the marker to true.
public class Trie {
private Mappaths=new HashMap ();
private boolean isWord=false;
public void insert(String data) {
if (data==null) {
return;
}
byte [] bytes=data.getBytes();
Trie current=this;
for (byte b: bytes) {
if (!current.paths.containsKey(b)) {
Trie next=new Trie();
current.paths.put(b, next);
current=next;
} else {
current=current.paths.get(b);
}
}
current.isWord=true;
}
public boolean contains(String data) {
return false;
}
}
This does what it says - iterate over the bytes and when you get to the end, mark it. Penalties are given for not checking for null and no points are lost if recustion is used instead of iteration.
Next, we do lookup:
public class Trie {
private Mappaths=new HashMap ();
private boolean isWord=false;
public void insert(String data) {
if (data==null) {
return;
}
byte [] bytes=data.getBytes();
Trie current=this;
for (byte b: bytes) {
if (!current.paths.containsKey(b)) {
Trie next=new Trie();
current.paths.put(b, next);
current=next;
} else {
current=current.paths.get(b);
}
}
current.isWord=true;
}
public boolean contains(String data) {
if (data==null) {
return false;
}
byte [] bytes=data.getBytes();
Trie current=this;
for (byte b: bytes) {
if (!current.paths.containsKey(b)) {
return false;
} else {
current=current.paths.get(b);
}
}
return current.isWord;
}
}
That's it. One working Trie, usable on strings.
Bonus:: Ok, so let's make it work on any arbitrary set of data, such as an array of Integers, but not limited to that. I want a generic Trie.
This is an exercise left to the reader - needless to say, it's not hard and getting it perfect will give you a Trie that will work for every future use. Post your solution in the comments below!
