Michael Kopp About the Author

Michael is aTechnical Product Manager at Compuware. Reach him at @mikopp

Pagination with Cassandra and what we can learn from it

Like everybody else it took me a while to wrap my head around the BigTable concepts in Cassandra. The brain needs some time to accept that a column in Cassandra is really not the same as a column in our beloved RDBMS. After that I wrote the first Web Application and run into a pretty typical problem. I needed to list a large number of results and needed to page this for my web page. And like many others I ran straight into the next wall. Only this time it was not Cassandras fault really and I thought I share what I found.

Pagination in the table oriented world

In the mind of every developer there is a simple solution for paging. You add an sequence column to the table that is monotonically increasing and use a select like the following:

select * from my_data where sq_num between 25 and 50

This would get me 25 rows. It is fast too, because I made sure the sq_num column had an index attached to it. Now on the face of it this sounds easy, but you run into problems quickly. Almost every use case requires the result to be sorted by some of the columns. In addition the data would not be static, but be inserted to and possible updated all the time. Imagine you are returning a list of names, sorted by first name. The sq_cnt approach will not work because you cannot re-sequence large amounts of data every time. But luckily databases have a solution for that. You can do crazy selects like the following:

select name, address
  from (
  select rownum r, name, address
    from person sort by name;
  )
where r >  25 and
      r < 50;

It looks crazy, but is actually quite fast on Oracle (and I think SQLServer too) as it is optimized for it. Although all databases have similar concepts, most don’t do so well in terms of performance. Often the only thing possible, with acceptable performance is to limit the number of return rows. Offset queries, as presented here, incur a serve performance overhead. With that in mind I tried to do the same for Cassandra.

Pagination in Cassandra

I had a very simple use case. I stored a list of Journeys on a per Tenant basis in a Column Family. The name of the Journey was the column name and the value was the actual journey. So getting the first 25 items was simple.

get_slice("key" : tenant_key,
	  "column_parent" : {"column_family" : "Journeys_by_Tenant"},
	  "predicate" :
		{ "slice_range" : 
 { "start" : "A", 
 "end" : "Z", 
 "reverse" : "false", 
 "count : "25" }
 } )

But like so many I got stuck here, how to get the next 25 items? I looked, but there was not “offset” parameter, so I checked doctor google and the first thing I found was: “Don’t do it!” But after some more reading I found the solution and it is very elegant indeed. More so than what I was doing in my RDBMS and best of all it is applicable to RDBMS!

The idea is simple, instead of using an numeric position and a counter you simply remember the last returned column name and use it as a starting point in your next request. So if the first result returned a list of Journeys and the 25th was “Bermuda” then the “next” button would execute the following:

get_slice("key" : tenant_key,
	  "column_parent" : {"column_family" : "Journeys_by_Tenant"},
	  "predicate" :
	   { "slice_range" : 
 { "start" : "Bermuda", 
 "end" : "Z", 
 "reverse" : "false", 
 "count : "26" }
 } )

You will notice that I now retrieve 26 items. This is because start and end are inclusive and I will simply ignore the first item in the result. Sounds super, but how to go backwards? Turns out it is simple. You use the “first” result of your last page and execute the following:

get_slice("key" : tenant_key,
	  "column_parent" : {"column_family" : "Journeys_by_Tenant"},
	  "predicate" :
	   { "slice_range" : 
 { "start" : "Bermuda", 
 "end" : "A", 
 "reverse" : "true", 
 "count : "26" }
 } )

The reverse attribute will tell get_slice to go backwards. What’s important is that the end of a reverse slice must be „before“ the start. Done! Well not quite. Having a “First” and “Last” button is no problem (simply use reverse starting with Z for the last page), but if like many Web pages, you want to have direct jumpers to the page numbers, you will have to add some ugly cruft code.

However you should ask yourself, how useful it is to jump to page 16 really! There is no contextual meaning of the 16th page. It might be better to add bookmarks like A,B,C instead of direct page numbers.

Applying this to RDBMS?

The pagination concept found in Cassandra can be applied to every RDBMS. For the first select simply limit the number of return rows either by ROWNUM, LIMIT or similar (you might also use the jdbc api).

select name, address
    from person
    sort by name
    fetch first 25 rows only;

For the “next” call, we can apply what we learned from Cassandra:

select name, address
    from person
    where name > 'Andreas'
    sort by name
    fetch first 25 rows only;

If we want to apply this to the “previous” button it will look like this:

select name, address
    from person
    where name < 'Michael'
    sort by name desc
    fetch first 25 rows only;

For the “Last” button simply omit the where clause. The advantage? It is far more portable then “offset selects” – virtually every database will support it. It should also be performing rather well, as long as you have an index on the  name column (the one that you sort by). Finally there is no need to have a counter column!

Conclusion

NoSQL Databases challenge us, because they require some rewiring of our RDBMS trained brain. However some of the things we learn can also make our RDBMS applications better.

Of course you can always do even better and build pagination into your API. Amazons SimpleDB is doing that, but more on SimpleDB later, stay tuned…

Comments

  1. Peter lin says:

    That isn’t always going to work. If some data was inserted between the time the user views page 1 and page 2, the starting point could be wrong. In many cases, that’s probably fine. If someone is using Cassandra to store emails, then it won’t work. Take for example a web email system. If the user loads page 1 at T1 and loads page 2 at T2, several minutes could have elapsed. In that time, several messages could have arrived. Having said that, the emails for one user should be stored in multiple columns in a single row, so the problem of pagination is a bit more complicated.

  2. If I understood your example correctly then the “new” emails should actually be before “page 1″ so if the user would go from 1 to 2 he would expect older mails so that is fine. The question here is rather that page 1 would be different. So I would refresh it and then go on.
    In case of emails I would sort by time most of the time.

    But you are of course right, It’s not a silver bullet, nothing ever is…

  3. Doesn’t that only work if the currently sorted column only contains unique values? What if you were to, say, paginate through a set of blog posts sorted by author?

  4. for cassandra we are talking about the name of the column which obviously is unique within the row.

    but yes the sorted value should be unique (it doesn’t have too if you add cruft code). or would combine author and blogdate (or whatever you like) and then the concept applies although you would have to adjust the details.

  5. Just remember atleast for RDMSs this will do at best an index scan, at worst a table scan when doing your where name > ‘someval’. Not optimal.

  6. Every “range” select will do an index scan at best and a table scan at worst. The only way to avoid this is to use a counter column with a between (even that might lead to an index scan), which is not feasible in most cases. As the index on the name should be a sorted index the range scan is cheap, especially if it is a limit select.

  7. This approach is not possible on a public website caring about SEO. It creates tons of useless pages/URLs.

    Also, it works by changing the spec to match what the system can do. Acceptable, but never preferable.

  8. @Tobi,

    As per our discussion your problem with this approach is that the url’s would change whenever new users (or new data items) pop up in the list. e.g. page two could mutate from get /users?startAt=tobi to get /users?startAt=tobias which would mean that the search engine index would be constantly outdated.

    I understand the problem, however:
    - I would not built a page where the next button would result in a url get, but rather an ajax request that would add content to the active page. That is how all pages with a lot of content do it (dzone, twitter…). so the url itself would not change, just the content. Search engines like google can handle this for indexing the results. The search engine could not open the “portion” of the result directly.
    - I guess I normally don’t really care whether the “list” page can be found by the search engine, but rather the item behind it. in case of an ecommerce site I care about the fact that the SE finds my products right? This would definitely work. If you care about the “list” itself, then the described approach has some deficiencies.
    - I would allow ?startAt (or other options like category) for bookmarking purposes.
    - If I care about finding list pages itself, it does not matter whether the url is ?page=123 or ?page=A. and the content can change on either side! As said I would add “buttons” to the list to go to A or B or C, but not 1,2,3 and those A,B,C would result in a new URL. (just an example).

    Finally, the described approach might not be usable for everything of course, I was more concerned about how to do pagination for cassandra and just pointing out that it has some merits for RDBMS.

    If somebody has a better way to do (data) pagination in Cassandra (without extra self built index/counter tables) feel free to drop me a mail.

  9. The two problems are: a) reachability of detail pages; b) too many list pages in the index. As long as both are solved there are no problems. However an Ajax-Solution prevents people from opening multiple pages in multiple tabs. And if they hit back from a detail page, the current page might be reset. Going against REST principles has technical and usability disadvantages.

    I would advise the following: If data volume is not an issue, go with standard index-based paging. Only for performance reasons go with startAt.

  10. Mahendra M says:

    Hmmm, this will not work in a case where there are more than 25 entries with the same key. you will end up going in an infinite loop.

    CouchDB solves this by mandating a ‘startkey_docid’ parameter…

  11. Quite the reverse is the case. This works for thousands of entries inside one key. the start_key is changing for every next call, it is the position.

  12. @Mahendra, I think I got your what you meant now. you thought that the same ‘name’ could exist multiple times. In Cassandra you get slices of columns per key. the column name would be what is used to compare against start and end key. the column name is unique within a key.

  13. Mike Burroughs says:

    @Mahendra To deal with this situation you would use a composite key. The first element of the key would be the name, the second element would be some value that differentiated each of the different names. I typically use either the key for the entity represented by the key if there is one, or a TimeUUID if not. You could also use an increasing integer value or anything else that would be guaranteed unique amongst the items in your database.

  14. I need pagination for Vaadin-based UI (GWT-based); and this UI has natural limitation to Integer.MAX_VALUE, although I expect “Long” instead… and “Table” UI widget can skip several pages without loading content (if we “scroll” fast enough), so that we will need page 15 after looking at page 7.

    Right now, I initialize internal array of RowKey(s), limited to 10000, and it takes 8-12 seconds which is acceptable; then I can paginate container and it is super fast.

    So that, I am using small in-memory index, row_number -> RowKey, and paginate over row_number to find RowKey, and retrieve records using RowKey.

    RowKey is only 16 bytes (it is MD5 digest); and for each “container” instance I need 160Kb bytearray, and 8-12 seconds to initialize it.

    How to improve this? perhaps I can use disk-based local index! I can serialize bytearray to local storage; I can even use EHCache (but it will be over-engineering).

    Technically, I need only to count all records; then I can use in-memory index for each 10,000 record set, and I can persist it locally for better performance.

    P.S.
    Indeed, why to paginate over billions of records? Isn’t it just crazy?!

    Instead, we can introduce filters (based on Cassandra native indexes), and paginate over smaller sets of records. In my scenario, in-memory index of 10000 takes 8-12 seconds to initialize; but I am just a beginner; we can use multithreading, sharding, and etc.

    We can even use native index for shards: let’s first 1000 records have column “shard” with a value “1″, next 1000 with a value “2″, etc.

    Thanks

  15. To recall, my naive algo works just fine with up to 10000 records; it uses in-memory index for (RowNum -> RowKey) warming up on the first query. It works fine/acceptable with lists of up to 10000.
    IMPROVEMENT: why can’t we “in-memory” index each 100s/1000s token? then we can support lists of up to 10000000!

Comments

*


3 + nine =