Fun with DHTs: Building a distributed bayesian classifier on a DHT

This follows on from the previous post about fuzzy search on a DHT.

As with last time, we recall that a DHT is basically a way to partition a large keyspace across a cluster of nodes. We find out which nodes are responsible for particular keys using consistent hashing and then we communicate with those peers.

A bayesian classifier is a fun little type of machine learning algorithm used commonly in spam filters and in other areas, but it can in theory be used to implement search too.

First, let's look into how we would build a bayesian classifier with a local hashtable as the backing store. We want to be able to train our classifier on "spam" and "not spam" and get back probability scores for particular emails. This is a very common task when constructing spam filters and thus serves as an ideal example.

The data structure consists of a hashtable or other key/value store containing a list of words or tokens and for each one a count of how often that token occurred in a particular category.

Step one is training: we cut each email example up into words or tokens and normalise as appropriate. Next, for each word in the spam example, if that word exists in our table already we increment the spam count on it, if it doesn't exist already we add it with a spam count of 1.

We repeat this for the not spam example, and keep doing this with more examples until we end up with a table of words showing for each word the relative counts within each category. While adding words, it often makes sense to normalise them as well and a simple algorithm can be used for this:

Add up the sum of counts in all categories (only 2 for our simple spam filter, but it can handle more in principle)
Divide each count by the sum, multiply by 100 to get a % and then store that %

This same rebalancing can then be used to perform classifications - we take the email to be classified, tokenize it and iterate through the tokens and then for each token we lookup the % stored in the table for spam and not spam. We then add up the % values for all tokens for spam, and do the same for not spam yielding 2 large numbers which we then calculate as % values to yield our final probability.

To turn this simple naive bayesian classifier into a hackish text search is simple: we just add content hashes as categories and "classify" search queries.

Think concurrent

So how do we go from the above to a distributed classifier that won't have really bizarre results when untrusted nodes are present?

First, the obvious part - we split up our keyspace as normal, mapping to multiple nodes for any particular token. This yields a small set of nodes which will on their own return different % values for that token in each category. We rebalance the % values (sum the values, divide each value by the sum, multiply by 100) and then end up with what the network says each token would be classified under on its own.

Next, we take just the classifications for each token and do the same as we did above with the spam filter to get a final % probability for each category. The problem of course is one malicious node could artificially inflate things drastically, so to counter that we simply take the mean average % value reported for each token (this assumes a majority of nodes are trustworthy). If we disagree with how accurate a classification is, we can fix that by altering weights for each of the nodes we're talking to and so over time the local node can train both itself as to the trustworthiness of its peers AND the rest of the network in the actual DHT.

The end result is a distributed bayesian classifier that will on average yield accurate results over time unless the network is extremely untrustworthy. By building search on top of this, you get a complete distributed search engine.