Web Architecture, Java Ecosystem, Software Craftsmanship

'MongoDB for Java Developers' (M101J) II: Lessons Learned

Posted on Mar 16, 2016

In the last blog post I shared my personal experiences with the course ”MongoDB for Java  Developer” (M101J). The second part revolves around the content. I summarize my personal takeaways and add some personal assessment in terms of the content.

'MongoDB for Java Developers' (M101J) II: Lessons Learned

Transactions in MongoDB

  • The following features are not supported by MongoDB for scalability purpose:
    • Joins
    • Transactions across multiple collections
  • There are transactions and atomic operations in MongoDB. But only operations performed on a single document are atomic. So a document will never be half updated visible to other operations. But changes on multiple documents are not isolated.
  • In reality, this is often fine, because in cases where you would have multiple tables in the RDB (and where in turn you would need transactions over multiple tables), you simply embed this data into a single document in MongoDB. So changing a single document atomically is often enough to access all necessary data and ensure consistence. However, when you need to update multiple documents you can’t do this within a single transaction.
  • How to overcome lack of transactions?
    • Restructure (take advantage of atomic operation within one document)
    • Implement in the application layer (transaction scope, locking, semaphore; often necessary anyway)
    • Tolerate

CRUD

  • BSON is a superset of JSON and contains more types (Long, Date, binary etc.)
  • The “upset” flag in update operations is useful if you either want to update an existing document (if it already exists) or to create it (if not).
  • update() only updates the first document that matches. Use multi:true to update all matching documents.
  • Java Driver: The MongoClient object is heavyweight (contains a pool of connections), so don’t recreate it every time).

Schema Design

  • The schema design should be driven by the access pattern of your application. What data is used, written and accessed together?  Organize the data that way.
  • This is contrary to RDB, where the schema is application agnostic.
  • There are no joins (because they are hard to scale). Instead, you can pre-join/embed data, that is used together.
  • There are not FK-constraints, but when you embed the data, you don’t need this constraint.
  • Embedding leads to
    • a) an improved read performance (data can be read from the disk at a stretch; only search for start of data once) and
    • b) only one round trip to the DB.

The critical question is: to embed or not to embed (use separate documents)? There are some high-level aspects, which are often contrary and can’t be achieved at the same time:

  • Match the access pattern of the application
  • Avoid duplication and anomalies. Achieve consistency.
  • Performance
  • Maximal document size
  • Atomic changes

Some concrete questions:

  • Do you need to avoid duplication and anomalies in order to achieve consistency? With a 1:1 relationship you can still embed safely, because the embedded element exists only once. But when there are 1:m or m:m relationships, you may need separate collections. However, sometimes you may still embed and denormalize your data (and create duplication) in order to increase the performance or to match the access pattern of your application.
  • Is performance critical? Embed data.
  • How do you want to access the data? What is the access direction? If you access an element only via another element (and not in isolation), you should embed it.
  • Which element is growing all the time? If you only update or insert the embedded element, you have some overhead, because you always load the enclosing element into memory. In this case a separate collection is the better choice.
  • Do you need independent access to the element? If yes, use separate collection. If no (access only though enclosing element), go with embedding.
  • Should an element exist even if another element is deleted? If yes, use separate collection. If not, embed.
  • Can an element belong to multiple elements (reuse)? If yes, use separate collection. If not, embed.
  • Is it likely that you need to change the data of an element? If yes, use separate collections to have only a single point for changing the element. Otherwise, you can embed the element, but in case of changes you have to alter all elements.
  • There is the maximum of 16 MB for a single document. If your document becomes too big, you have to use separate collections.
  • Atomicity of data. If you want to change the enclosing and embedded element atomically, you should embed. Separated docs can’t be changed in isolation.

My personal thoughts about some reasons for using MongoDB instead of a RDB:

  • Natural schema design due to embedding: Many domain models fits naturally to MongoDB. There is no need to slice your model and spread it across several tables. Embedding is much more natural and intuitive. Moreover, saving joins significantly boosts the performance.
  • High development speed and rapid prototyping due to flexible data schema. There is no enforced schema. You can just put the objects into the database without setting up a fixed schema up front.

In my blog post “Why Relational Databases are not the Cure-All” I’m talking about further reasons against RDB and for NoSQL approaches like MongoDB.

Performance

  • You can increase the performance by
    • Indexes
    • Distribute the load across multiple servers using sharding
  • As of MongoDB 3.0 the WiredTiger storage engine is available which is faster (lock free; optimistic concurrency) and requires less memory (compression).

Indexes

  • “COLLSCAN” (full collection scan) are expensive and slow. Use indexes to significantly speed up your queries (at the expense of slower writes).
  • An Index is a sorted list of all names. They point to a record that contains the document. Searching in this list is very fast (e.g. binary search).
  • Rule of thumb:
    • Have an index for each query executed by your application. Every query should hit an index!
    • Don’t have unused indexes! Have at least one query hitting it. Otherwise you have a waste since the index has to be maintained.
  • Compound index: always go from left to right. Let’s assume the index (name, city, birth). The following is true:
    • Search only for name hits the index
    • Search for name + city hits the index
    • Search for name + city + birth hits the index
    • Search only for city only or for birth doesn’t hit the index (collection scan is performed)
  • Debugging:
    • explain()  shows the winningPlan
    • explain(“executionStats”)  shows the used time and how many keys and documents have been examined.
    • explain(“allPlanExecution”)  shows the executionStats for the not-winningPlan.
    • Read the information returned by explain()  from the innermost element to out.
  • Choosing a good index
    • Aim: “nReturned” should be close to “totalKeysExamined” (explain(“executionStats”) )
    • Key goal when designing an index is “Selectivity”. An index is selective if it filters out all or a lot of entries that doesn’t match the query. Hence less or none documents have to be scanned.
    • Put the most selective part on the left-side of the compound index.
    • Rules of thumb for compound indexes:
      • Equality fields before range fields
      • Equality fields before sort fields
      • Sort fields before range fields
  • When it comes to sorting, there can be situations where it’s faster to examine more keys, but to sort with an index (which is already sorted).
  • Basically a non-compound index {prop:1}  can also be utilized for descending sort {prop:-1} . However, this is not valid for the non-first parts of a compound index: sort({propA:1, propB:-1})  does not use the index {propA:1, propB:1} . Instead, you would need the index {propA:1, propB:-1}.

Step by Step Performance Optimization

  • Use explain(“executionStats”) to show the examined keys and docs and the returned documents. Goal: Only scan the number of documents that are eventually returned.
  • Find slow queries: Turn on the profiler for showing the execution time for each query.
    • By default mongod logs slow queries (slower than 100 ms).
    • You can configure the profiler at startup (>mongod –profile=1 –slowms=0) or in mongo shell (db.setProfilingLevel(1, 0))
    • You can execute queries against the profile log: db.system.profile.find( {millis:{$gt:1000}} ).sort( {ts:-1} )
    • Find the longest running query: db.profile.find().sort( {millis:-1} ).limit(1)
  • Analyze the queries: Use explain(“executionStats”)  to investigate the execution time, keysExamined, docsExamined and whether an index was hit or not. (“IXSCAN” vs “COLLSCAN”).

Other useful tools:

  • Mongostat is a performance tuning tool and shows what’s going on in the database during one second (inserts, queries, updates, deletes, IO, memory usage, network etc.).
  • Mongotop provides a high-level overview of where MongoDB is spending its time (e.g. read and write time spend per collection).

Aggregation Framework

  • The aggregation framework can be used for more sophisticated queries or if you want to transform your data. For instance, you can count or group your data.
  • You build an aggregation pipeline consisting of operations. One operation can occur multiple times in a single pipeline. The operations in the pipeline are executed on the input one after another. It works pretty much like the Java 8 Stream API.
  • The most important operations are: $project, $match, $group, $sort, $skip, $limit, $unwind, $out

My personal thoughts about aggregation:

  • I personally don’t like the syntax to build aggregation pipelines. Dealing with the curly braces hell is a cumbersome pain. Using the Document class of the Java Driver is even worse. But the Aggregation Support of Spring Data MongoDB looks promising.
  • However, the aggregation framework is very useful. I can get nearly everything out of my documents without having an optimal schema for this query. But in general your schema design should be application driven and the application should fulfill its daily queries without the aggregation framework. But using aggregation for ad-hoc queries or BI is absolutely fine.

Application Engineering

  • Write Concern (“w”). Possible values:
    • 0: unacknowledged wait; send write but don’t wait for ACK
    • 1 (default): wait for ACK
    • n: wait for ACK of n nodes in replica set
    • ‘majority’: wait for ACK of the majority of nodes.
  • Journal (j)
    • false (default): Data is only written to memory. Don’t wait for journals to be written to disk.
    • true: ensures that journals are written to disk.

Replication

  • Aims: Availability and Fault Tolerance. Not Performance (at least at first).
  • Updates are written to the primary and asynchronously replicated to the secondaries.

Write Consistency

  • Writes always go to the primary.
  • In the default configuration all reads go to primary
    • Results in strong consistency (you read what you write; other apps see your writes)
  • You can allow reads to go also to secondaries (“read preference”)
    • Benefit: read scaling
    • Drawback: read data can be outdated (due to the lag for data synchronization)
    • Results in eventual consistency (sometimes not ok, e.g. when one request writes a session and this session is read out in subsequent requests. But you can configure the read preference at collection-level)
  • MongoDB provides strong consistency in the default configuration.

Some personal thoughts: When talking about “Consistency” and MongoDB we need to differentiate.

  • Consistency in terms of transactions (which means actually Atomicity): All operations of a transaction are either executed completely or not at all. The effect is only visible in case of success. MongoDB does not fulfill this completely. Operations within one document are atomic. But you can’t update two different documents atomically (see above).
  • Consistency in terms of data across several nodes: I can read what I had written before. MongoDB is strictly consistent (in the default configuration), because all reads and writes go to the primary. If the primary ACKs a write, you can be sure, that subsequent reads will return the new data. However, you can configure the read preference to allow secondaries to reading. In this case you may read outdated data (not yet replicated). In this case you have eventual consistency.

Java Driver

  • new MongoClient(asList(serverAddress1, serverAddress2))
  • If you pass a List of Mongo addresses the driver switches to the replica set mode. In this case it automatically discovers the primary in the replica set. However, if you only pass one address and this one is not available on start-up the driver doesn’t start.
  • Best Practices:
    • Always define several members of the replica set. So it’s not so bad if one is not available on start-up. The driver still finds the replica set and the primary.
    • Be aware of the MongoExceptions which will be thrown in case of a failure (e.g. failover, network errors, unique key constraints). Catch the exception if necessary (e.g. when you want to continue with another element instead of aborting completely)

Why shouldn’t I read from secondaries?

  • The secondaries can get overwhelmed if there is great write traffic and a weak secondary (secondaries need to process all writes AND reads)
  • You don’t read what you previously wrote (the secondary lags behind). Consider if this is acceptable for your use case.

Implications of Replication on Development

  • Use seed lists of ServerAddress objects
  • Write Concerns
  • Read Preferences
  • Handle MongoException

Sharding

  • Scaling out
  • Split data between shards (shards are in turn often a replica set)
  • Queries are distributed by a router called “mongos”
  • You can configure a range-based or a hash-based distribution (more even distribution of data, but worse performance for range-based queries)

Implications of Sharding on Development

  • Every document has to include the shard key
  • The shard key is immutable
  • Need an index on shard key or non-multi-key index that starts with the shard key
  • For updates you have specify the shard key (or multi has to be true). If there is no shard key and multi is true, the update is sent to all nodes.
  • If no shard key if provided in the query, all nodes are queried and the results are gathered together (“scatter gather”). That’s expensive.
  • You can’t have a unique key (unless part of the shard keys). There is no way to enforce uniqueness, because the index is on each shard. Hence, there is no collective way for Mongo to know if the uniqueness is satisfied.

Choosing a Shard Key

  • Sufficient cardinality (variety of values)
    • You can add a second part in the key with more cardinality
  • Avoid hotspotting in writes. This happens when the key is monotonically increasing.
    • e.g. ID or date is monotonically increasing, so inserts are hammering the same shard. If you write frequently, this will hit the performance.

Misc

  • In many cases you often don’t need an intermediate cache layer. This is contrary to RDBs where you want to cache aggregated object that were constructed by joining many tables. With MongoDB you can address performance directly at the database layer (by embedding, index, replication, sharding). Hence, you can attach your application directly to the database without a caching layer.
  • Moving to the cloud
    • Benefit: easy provisioning and horizontally scaling
    • Drawback: latency overhead between application and database box. It’s often a latency and not a bandwidth problem, so you can try to batch database queries in one request or use asynchronous approaches.