Review of my experience with Google Datastore

in google •  7 years ago  (edited)

Google Cloud Datastore is a schema less NoSQL data repository that is infinitely scalable and managed completely by Google. It is emerging as a work horse database provided that primary use cases don’t include OLTP or Analytic workloads. Amazon of course has dynamoDB but I will not get into the comparison of the two databases here.

I found Cloud Datastore to be fun to use and it fit nicely with the workload I was implementing. My workload was primary an intelligent document management system and Datastore was perfect. I am sure that you can find a lot of documentation on the internet, stack overflow and Google itself regarding Cloud Datastore capabilities. But I am more interested in talking about my experiences with it

Right off the bat I have to say that its scalability is amazing since its read/write latencies remained steady throughout our testing period. It however has some unique quirks and restrictions that I had to get used to while working with the beast! Google does a great job of laying the ground rules down in their documentation but you will only ever know the reality when your app is running (or staggering!)

Every single attribute of an entity is indexed

Imagine that any attribute that you create inside a Datastore entity is automatically indexed. This means that a write to persist the entity will also result in the indexes being persisted as well. Indexes are part of the entity being written but internally Datastore will break up the request into entity write + index write. Naturally more indexes means less performance per entity write. Google refers to attributes as properties just in case you were wondering!

I liked the fact that Datastore gives you a manual way to exclude attributes from being indexed. This feature can be used to divide attributes into those that can be indexed and those that cannot. I found it advisable to not let these decisions be taken by a programmer!! In fact you will find yourselves watching the entity writes on a daily basis and hating the upward spikes that occur when loaded entities are written.

Eventually we got around this by writing our classes in such a way that all attributes were excluded by default and we had to programmatically set the index characteristic based on the model object.

Multiple values within the same attribute can be indexed differently

This really hurt us and I have to admit that this was a case of not reading the documentation closely. The problem is simple and involves an attribute X which can have multiple data types is indexed within the entity. If you write code that is heavily front-loaded onto mobile apps consuming JSON then you will run into this problem. There is a natural tendency to keep most things as strings so that representation and handling within the UI is scalable and less confusing. The attribute in our case was a instanceID which could be an integer or string.

When we ran search queries to get the latest document, we would many times never get the latest entry. Eventually we figured out that the integer value as being sorted ahead of the string value and hence chaos ensued. Debugging this was a nightmare and no amount of stackdriver logging helped us!

Embedded entities can cause search performance headaches

The coolest thing about Datastore was that we could create embedded objects within entities. It allowed us to model related objects together and avoid unnecessary database queries to retrieve related data. So we created an attribute mapper that would automatically embed nested hashes as sub entities. Searching for an entity thus involved searching for attributes within the sub entity as well. Since each sub entity in Datastore is an independent entity with a separate entity key this is essentially doubling search requests. You have to really be careful if your sub entities have arrays or you have an array of sub entities. This had the effect of multiplying the access times by 10X (in our particular model). Writes/Reads are compounded into multiple objects per single write request and those are fine.

We ended up not indexing sub entities at all!

Search queries have to be thought of before entity design

This is a common requirement across NoSQL databases but because Datastore encourages dynamic attribute creation, one would be definitely better off modeling all search queries first. Searches/Sorts will not work if the attributes are not indexed which was fairly obvious from the documentation. The query will silently fail because of the incorrect modeling and puts considerable strain on getting the search queries right before you rush off into implementation.

We ended up getting an object to declare its own search method and that would keep a list of searchable attributes in the entity and implemented a query builder to ensure that queries could not be built unless we had the right indexes put together. Though this was slower it saved us a lot of grief when the database grew.

Manual namespace selections are a nightmare

Google allows two different ways to use the namespace for their entities. One is to set the namespace at the time of connecting to the Datastore and the other is to set it prior to query execution. The former is isolated in its approach and the latter is pure hell waiting to break loose. If you get the namespace wrong or the namespace does not exist then all queries fail silently. If you create an incorrectly namespaced entity key before storing an object, there is no mode within the Datastore to warn you about writing data that can never be read back.

We initially tried to keep a separate key cache that would associate the namespace correctly. We dropped that when we realized that performance was almost halved. At this point of time we have created a separate app instance which has the namespace set at instantiation. This is quite expensive but it is easier compared to programming complexity in per request approach.

Be careful with the emulator

Datastore emulator is great and it helps enormously to test against it. Any complex modeling must be first attempted on the emulator. But just keep in mind that there are subtle differences between the emulator and the actual cloud APIs. E.g. one of the entity keys that we were writing always worked in the emulator mode and failed sporadically when we bypassed the emulator. It turned out that the key data type is almost always checked by the RPC marshalling code in the actual code but not in the emulator. We also had some issues with multiple indexes for an entity. The emulator would always throw up an exception if the index was missing but the cloud version never did. I believe this has been fixed now but I have not verified it

We overloaded our persistence classes with entity key/property type checkers so that we would not be hit with unpleasant errors later on.

Closing thoughts on the Datastore

It is a terrific service and one that enterprises should utilize for running non transactional apps. The Google console is absolutely useful to monitor and assess the predictability and performance of your architecture. There were some timeout issues with Datastore and AppEngine that went away when we moved from fixed to flexible instances. If you want a drop in solution for your enterprise app, just write yourself a SHIM layer to interface with the Datastore and hit the ground running. Overall I have had a very positive experience.

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!