Doing the impossible: decentralised fuzzy search on a DHT

@heimdanger recently introduced us to DTube here:
https://steemit.com/video/@heimindanger/introducing-dtube-a-decentralized-video-platform-using-steem-and-ipfs

In his post he states this:

I have literally no idea if making a fuzzy search algorithm on a DHT network is possible without creating a point of failure, and I don't think anyone has an answer to this question on Earth

A DHT implements a form of key/value lookup by using consistent hashing to partition the keyspace across a network of peers.

When searching for a particular key, it's hashed and the current set of connected peers is compared to the resulting consistent hash to find which peers are responsible for that value. When on an untrusted P2P network, there is often some form of majority consensus required or cryptographic verification of the results, but that's the basics.

Fuzzy search requires us to lookup the keys without perfect matches on those keys and is often used to account for spelling errors. If I do a fuzzy search for steamit, it's likely I meant "steemit" and the search engine will correct me.

To do this, it must look at the number of operations required to transform one string into another: in this case only 1 byte need change, and so it's high up on the list of matches.

Doing this with a DHT is problematic because one small change in the input of a hash algorithm, by design, causes large changes in the output - making the problem appear impossible.

This is an example of not thinking concurrently. The strength of a DHT is that it is distributed and concurrent by its very nature. Rather than thinking in terms of a classic fuzzy search, we need to look at how we would do it in a parallel cluster - because that is what we have.

So let's approach the problem this way: we need to find which words have the lowest hamming distance from our search words.

To simplify it further, let's make it a single word we're searching for. We want to find "steemit" from "steamit", how might we do this in a cluster with lots of keys?

To find the hamming distance between 2 words is trivial to do in parallel with brute force, and so if we were to search EVERY key in a database and compare one by one, we'd get our result - but this is horrendously inefficient.

Fixing the problem, we can use sharding: we transform "steemit" into "stemi" and sort alphabetically to arrive at eimst.

We next transform "steamit" the same way to get "aeimst".

We lookup the keys for both on the DHT, yielding a list of words observed in actual steem content for each, comparing locally to find which matches our actual search query best. This yields the nearest result that actually exists somewhere.

The nodes in the DHT that store particular words will always be mapped, with redundancy, by the consistent hashing algorithm used by the DHT software.

Finally, we simply lookup the word "steemit" across the DHT and get a bunch of results for content that contains the word. We can also at the same time lookup the word "steamit" and other lower matching terms if desired.

How do we get the word mappings into the network in the first place you may be asking - that part is simple and can be decentralised too. When we post a piece of content we want to be searchable, we extract a list of words from it, and then we send those words out to the network with the transformed mappings used above for sharding.

Whenever a search query is made, the results of that query can be reinserted into the network, updating the shards that the searching peer is connected to and ensuring distribution of the shards (since each peer itself has a limited view of the network). Over time, the entire network will tend towards knowing which words are present in steem content.

Upon searching for an actual word, the DHT (in another keyspace) can return content hashes that contain the word, and counts of that word in that content, each node only returning part of the overall total of the available content and storing only mappings of words to content hashes and word counts.

With multi-word search queries, we simply transform each word as described for the sharding, yielding shards for every word.

Then it's a simple matter of building a distributed bayesian classifier to find content that most closely matches the search terms.

Sort Order:

Trending

[-]

deshwal (45) · 8 years ago

nice content upvoted

$0.00

heimindanger (68) · 8 years ago

Did you do any testing of this method? In particular, what if someone spams some searches on some popular 'shards'? Don't you think it would introduce a point of failure?

Why wouldn't anyone have done it by now, for something like BitTorrent for example then?

Can you code something like that?

garethnelsonuk (60) · 8 years ago

Each peer in the network connects to a random set of other peers and thus has a unique view of the network. When any particular node runs a search with its currently connected peers, that is only a small subset.

In other words, each shard is replicated and mirrored on multiple nodes, because each lookup of a word returns multiple peers that have data about that word.

So this solution would waste a bunch of storage space.

Doesn't exact search on DHT usually work with bounces? Nodes relay the search (to less and less nodes depending on the number of already made bounces), and if a node with a positive result isn't connected to the node that originally sent the search, then connect to it. I saw a few DHTs using this technique succesfully. Clearly doing the same for fuzzy searching isn't possible as there is not positive (1) or negative (0) results, but also anything in the middle.

Seriously man, I'm very interested if your solution could work, and I'd rather see schematics or code rather than text. It would be much clearer to me.

I have some personal open source code of a DHT using WebRTC as a transport where I implemented exact search this way. I tested it and it's fine. I never really thought of any other way.

https://github.com/skzap/waka2/blob/master/peer.js

Can you code my friend?

1 vote

I'll work on some actual code later today or tomorrow to demonstrate how it could work.