Improving Harvesting Strategy for a Distributed Social Media Search Engine

in search-engine •  7 years ago  (edited)

About Kaizen Harvester

Kaizen is an alternative approach to do harvesting in loklak. It focuses on query and information collecting to generate more queries from collected timelines. It maintains a queue of query that is populated by extracting following information from timelines –

  1. Hashtags in Tweets
  2. User mentions in Tweets
  3. Tweets from areas near to each Tweet in the timeline
  4. Tweets older than oldest Tweet in the timeline

Further, it can also utilise Twitter API to get trending keywords from Twitter and get search suggestions from other loklak peers.

It was introduced by @yukiisbored in pull request loklak/loklak_server#960.

The Problem: Unbiased Harvesting Decision

The Kaizen harvester either searches for queries from the queue or tries to grab trending queries (using Twitter API or from backend). In the previous version of KaizenHarvester, the decision of “harvesting vs. info-grabbing” was taken based on the value from a random boolean generator –

@Override
public int harvest() {
   if (!queries.isEmpty() && random.nextBoolean())
       return harvestMessages();
   grabSuggestions();
   return 0;
}

[SOURCE]

Under sane situations, the Kaizen harvester is configured to use a fixed size queue and drops the queries which are requested to get added once the queue is full. And since the decision doesn’t take into account the amount to which queue is filled, it would often call the grabSuggestions() method.

But since the queue would be full, the grabbed suggestions would simply be lost. This would result in wastage of time and resources in fetching the suggestions (from backend or API). To overcome this, something better was to be done in this part.

The Solution: Making Decision Biased

To solve the problem of dumb harvesting decision, the harvester was triggered based on the following steps –

  1. Calculate the ratio of queue filled (q.size() / q.maxSize()).
  2. Generate a random floating point number between 0 and 1.
  3. If the number is less than the fraction, harvest. Otherwise get harvesting suggestions.

Why would this work?

Initially, when the queue is mostly empty, the ratio would be a small number. So, it would be highly probable that a random number generated between 0 and 1 would be greater than the ratio. And Kaizen would go for grabbing search suggestions.

If this ratio is large (i.e. the queue is almost full), it would be highly likely that the random number generated would be less than it, making it more likely to search for results instead of grabbing suggestions

Graph?

The following graph shows how the harvester decision would change. It performs 10k iterations for a given queue ratio and plots the number of times harvesting decision was taken.

Change in Code

The harvest() method was changed in loklak/loklak_server#1158 to take smart decision of harvesting vs. info-grabbing in the following manner –

@Override
public int harvest() {
   float targetProb = random.nextFloat();
   float prob = 0.5F;
   if (QUERIES_LIMIT > 0) {
       prob = queries.size() / (float)QUERIES_LIMIT;
   }
   if (!queries.isEmpty() && targetProb < prob) {
       return harvestMessages();
   }

   grabSuggestions();

   return 0;
}

[SOURCE]

Conclusion

This change brought enhancement in the Kaizen harvester and made it more sensible to how fast its queue if filling. There are no more requests made to the backend for suggestions whose queries are not added to the queue.

Resources

Originally posted at FOSSASIA blog - Improving Harvesting Decision for Kaizen Harvester in loklak server

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

The picture is funny!

Thanks, Rajat!

Hi! I am a robot. I just upvoted you! I found similar content that readers might be interested in:
https://blog.fossasia.org/improving-harvesting-decision-for-kaizen-harvester-in-loklak-server/