Java Ecosystem, Kotlin, Distributed Systems, Sociology of Software Development

Web API Pagination with the 'Timestamp_ID' Continuation Token

Posted on Jan 31, 2018

In the last post, I presented the Timestamp_Offset_Checkum continuation token for implementing web API pagination. After using them in practice for a while we discovered some drawbacks and finally came up with a new approach: Timestamp_ID. This approach is fast, scalable, efficient, stateless and easy to implement. Our pursuit of the best pagination strategy has finally come to an end.

Web API Pagination with the 'Timestamp_ID' Continuation Token

TL;DR

  • There are many different pagination approaches with different characteristics.
  • The Timestamp_Offset_Checksum token is getting slower and slower the more elements with the same timestamp exist.
  • After contemplating and implementing many pagination approaches in practice, we finally ended up with the Timestamp_ID token.
  • Basically, it’s the query WHERE (timestampColumn > T OR (timestampColumn = T AND idColumn > I)) ORDER BY timestampColumn asc, idColumn asc;
  • The advantages:
    • Fast: It operates only on indexed columns. No slow count or OFFSET queries.
    • Reliable: No element is missed. No endless loops.
    • Scalable: Constant response times even in case of a huge amount of elements with the same timestamp.
    • Efficient: Basically, each element is only delivered once to the client. Elements are only delivered multiple times if their timestamp is updated during a single pagination run.
    • Easy Implementation.
    • Stateless: No server-side state required. Easy load balancing.
    • No token expiration. It’s no problem to stop the pagination and resume later (e.g. after an outage).
    • Easy client implementation and evolvability of the token. The client receives the token and passes it back to the server - as a black box. This, in turn, makes later changes in the token structure easy.

Overview

Let’s start with a recap of some web API pagination approaches.

A Little Taxonomy of Web API Pagination Approaches

  • Offset-based pagination. Already discussed in the previous post. Not reliable and really slow.
  • Keyset-based pagination aka. Continuation Tokens. In general, faster and more reliable. The concrete scalability, reliability, and efficiency depend on the approach.
    • Approaches requiring a schema change.
      • Adding a dedicated integer column which is filled by a database sequence on every update and insert. Drawbacks: A schema change is required and an additional index has to be maintained.
    • Approaches requiring no schema change.

Continuation Token Approaches

In general, a continuation token points to the last element of the current page. It is passed back to the server in order to retrieve the next page. By looking at the token, the server knows where he has to continue in order to deliver the next page.

Name Format Advantages Drawbacks
T Timestamp - Dead simple implementation - Inefficient. The client is likely to see the same elements multiple times
- Unreliable. Endless loops can occur. See details
TOC Timestamp_Offset_Checksum - Less elements are send twice
- No endless loops
- Tricky implementation
- Slow/Timeouts in case of many elements with the same timestamp (> 15000 elements). See details
TI Timestamp_ID - Simple implementation
- Scalability. Constant performance even in case of a huge amount of elements with the same timestamp
- Efficiency. Highly reduced number of elements that are delivered twice to the client (only in case of updates)

The Timestamp_ID Continuation Token

The token has the format Timestamp_ID and contains only two parts:

  • Timestamp: The timestamp of the last element of the current page. It’s usually mapped to a column like modificationDate or creationDate.
  • ID: The ID (primary key) of the last element of the current page. This is necessary to distinguish between elements with the same timestamp.

The implementation boils down to a simple but smart SQL WHERE clause:

-- Given that T is the timestamp and I is the id contained in the token.
SELECT * FROM elementTable
WHERE (
  timestampColumn > T 
  OR (timestampColumn = T AND idColumn > I)
)
AND timestampColumn < now()
ORDER BY timestampColumn asc, idColumn asc;
-- The ids in the idColumn must be unique (out-of-the-box for primary keys)
-- We need an index on both columns timestampColumn and idColumn

That’s it. No fancy algorithm that has to be applied in addition to this query. Let’s consider some cases to see this approach in action.

All elements have different timestamps.

All elements have different timestamps.

If all elements have different timestamps and the token is 30_3, searching for all elements with timestamp > 30 cuts it. We correctly continue with element 4.

Multiple elements with the same timestamp 2 are overlapping two pages.

Multiple elements with the same timestamp 2 are overlapping two pages.

In case of multiple elements with the same timestamp 2, we need the ID part of the token. Otherwise, we would miss all elements with the timestamp 20 (if we would only query for timestamp > 20). But adding the clause timestamp = 20 AND id > 3 will include the elements 4 and 5 in the next page.

All elements have the same timestamp

All elements have the same timestamp

The second clause get even more relevant if all elements have the same timestamp (like 10). No element is missed.

Note that the approach can easily handle a huge number of elements with the same timestamp. We don’t run into endless loops (contrary to the simple Timestamp approach) and keep a constant performance due to a constant LIMIT (contrary to the Timestamp_Offset_Checksum approach).

An element is changed between two page requests.

An element is changed between two page requests.

Let’s assume that the timestamp of an element (like the date of the last modification) is updated (i.e. set to a higher value) during two page requests. Take a look at element 3 with the timestamp 20. Its timestamp is set to 99 after page 1 is delivered. So it moves to the very end of the element sequence. Hence, it’s delivered twice to the client - or even multiple times. The client has to deal with it. But the most important point here is that we won’t miss any element (like element 4).

What about AND timestampColumn < now()?

We haven’t mentioned the additional condition AND timestampColumn < now() yet. To illustrate its relevance, let’s assume that it would not be there. Then, assume the following actions are happening within the same timestamp (e.g. 99):

  • Element 3’s timestamp is updated (to 99)
  • The last page is requested and delivered. Element 3 is the last element in this page. The token is 99_3.
  • Now element 2 (note that 2 < 3!) is updated (the current timestamp is still 99. So that’s the new timestamp).
Missed elements at the last page if we would deliver elements with the current timestamp.

Missed elements at the last page if we would deliver elements with the current timestamp.

In this cases, the next request with the token 99_3 would not return the element 2. So we may miss elements when we are requesting the last page and changes are happening at exactly the same timestamp. Is that likely? Well, it depends on the precision of our timestamps. It’s highly recommended to use millisecond precision which makes those situations unlikely (but not impossible). But especially when it comes to legacy applications and schemas we may face second precision. Probably, we’ll miss elements.

Independent of the precision, we can fix that problem by adding AND timestampColumn < now() to the SQL query. This way, we don’t deliver the element 3 anymore (because now() returns 99). So the continuation token becomes 20_2 (instead of 99_3). Hence, the next request (usually in the next pagination run) will correctly return both updated elements 2 and 3.

Some final thoughts and implications:

  • Our client lags 1 millisecond or 1 seconds behind the reality. We may have to do more requests to fetch the same amount of elements (which highly depends on the timestamp precision and the request frequency). Decide if this is really an issue for your system. For us, it’s not. For a more detailed discussions check out the comments.
  • Again, use timestamps with millisecond precision. If you are facing columns with second precision, consider a schema migration (if possible). For instance, MySQL 5.6.4 finally supports them.
  • Implementation: Mind, that there may be slight differences between now() in our application and in our database. Timestamps may differ. This would either increase the lag or annul the whole fix. Different timestamps shouldn’t happen, but reality tells other stories. Take away: If we generate the value of now() in our application (e.g. via Instant.now() or new Date()) we also have to generate the timestamps for updating and inserting in the application layer. Don’t let the database do it (e.g. via database functions like now() or current_timestamp()).

Constraints

  • There has to be an index on both the ID and the timestamp column.
  • The IDs have to be unique.
  • We have to sort after both the timestamp and the ID in every query. This way, we get a constant order even in case of equal timestamps.
  • It can happen that we deliver an element multiple times during one pagination run (in case of timestamp changes and moving elements). The client has to get along with this; he has to be idempotent.
  • Depending on the timestamp precision, the client is lagging 1 ms or 1 s behind the reality.

Example Implementation

The implementation is really simple. Nevertheless, I created a small example service with Kotlin, HTTP4K and a MySQL-flavoured H2 database. Check out my GitHub repository ti-continuation-token. Especially, the DAO and the classes in the token folder are relevant.

Contribution

  • I really like to thank my colleague Tom Vorwerk for pointing me to this approach!
  • Again, thanks go to Justin Karneges for pointing to a shortcoming in the comment section. This finally leads to the additional AND timestampColumn < now() clause fixing this issue.