5 Strategies for Random Records from DB

Benchmarking different approaches to fetch random database records

random-records-database.png

TLDR: I’m now using Strategy #5

Table of Contents


If you’ve ever needed to collect a random record from a database whether using Rails or another framework/language, you may have found that there are several strategies to choose from. In this post, we’ll explore the top 5 strategies for collecting random records, along with some tips and caveats to keep in mind.

Why Get Random Records?

Before we dive into the strategies, it’s worth taking a moment to consider why someone might need to collect a random record from a database. Generally speaking random behavior is rarely a desired feature in an application, as it leads to non-deterministic results. However, it may be needed for business reasons.

One common use case is for displaying random content on a website or app, such as “random image” feature, or “random author” on Book tracking website. Another use case is for testing or debugging, where it may be useful to select a random record to check for edge cases or unexpected behavior.

Elephant programmer analyzing database queries and performance metrics


Strategy #1 - Use RANDOM()

The first strategy is to use the database’s built-in RANDOM function to select a random record. For PostgreSQL or SQLite, use RANDOM(), and for MySQL or MariaDB, use RAND(). While this strategy is the most robust, it can be slow as the table grows, taking around 260ms on a table with 1,000,000 records in my sample data.

Pros:

Cons:

Author.order('RANDOM()').limit(1)

Strategy #2 - Pluck & Array.sample

Another strategy is to select ALL IDs by plucking or selecting the IDs from the database and using Ruby’s Array#sample method to select one of the IDs at random. While this strategy is faster than strategy #1, it can still get slow as the table grows, taking around 50ms on a table with 1,000,000 records. Also consider the amount of memory such strategy might consume.

Pros:

Cons:

Author.find(Author.pluck(:id).sample)

Strategy #3 - find_by with Min/Max ID

A third strategy is to get the minimum and maximum values of the IDs in the database, use rand(min..max) to generate a random ID, and then use find_by to look up the record. This strategy is extremely fast, taking around 17ms on a table with 1,000,000 records, but it can be brittle if there are deleted records.

Pros:

Cons:

Author.find_by(rand(1..Author.count))

Strategy #4 - Random Offset

A fourth strategy is to use a random offset to look up a record. This strategy is not as fast as the others, taking around 120ms on a table with 1,000,000 records, but it doesn’t require error handling and works with any type of ID.

Pros:

Cons:

Author.offset(rand(1..Author.count)).limit(1)

Strategy #5 - Where with Min/Max

A fifth strategy, and my preferred strategy, is to get the minimum and maximum values of the IDs in the database, use rand(min..max) to generate a random ID, and then use a where lookup to select the record. This strategy is extremely fast, taking only 1-6ms on a table with 1,000,000 records, and it’s also robust, handling deleted records without issue.

Pros:

Cons:

# `random_id` Can be memoized / cached,
# but beware of scoping before caching(!!!)
# as you can get empty results where many are expected
random_id = rand(Author.minimum(:id)..Author.maximum(:id))
Author.where("id >= ?", random_id).limit(1)

Out of these strategies, my preferred strategy used to be the third one. Even if the table has some deleted records, it is still faster on average for my use-case. Our deletes from that table are also rare, and we use soft deletes that I can use in my error handling. I sometimes used strategy #4, as it also works well.

However, since I first used this code, I was answering a StackOverflow question that lead to this blog post, and came up with strategy #5 which is now my preferred option.

Ruby developer elephant working on database optimization code


Tips and Caveats

A couple of tips to keep in mind when implementing these strategies:

Caching Author.count:

If the actual current count is not critical, you can preload Author.count and memoize it as Author.total_count for example or keep it in an app-load config that gets loaded on app restart (if you do them often). This will remove the extra time needed to get Author.count and speed up option 3 to under 5ms and option 4 to about 100ms. Here’s an example of how to do this:

def self.total_count
  @total_count ||= count
end

Be Careful with Optimizations:

However, I would warn against going crazy with optimizations here, such as caching min/max values, as depending on the context, scoping, etc, it may lead to unexpected results. If you use strategy #5, the bottleneck is not going to be in the database, so making 3 quick queries is preferable, unless it really-really matters for your WebScale™ app.

Error Handling:

When it comes to error handling, you have a couple of options. You can do lookups until you find a record that exists, or you can do multiple lookups using #3 and select a random one from the result set. Depending on how many gaps you have in your data set, strategy #3 may not be the best choice.


Your Thoughts?

Did I miss any approaches, or made a glaring blunder? Please, comment and let me know :)


Originally published: dev.to/konung/5-strategies-for-random-records-from-db-5ed5

Cover Art: MidJourney AI for “random records from database” prompt


Last updated: February 20, 2023

ruby rails postgresql mysql database performance