5 Strategies for Random Records from DB

TLDR: I’m now using Strategy #5

Table of Contents
Why Get Random Records?
Strategy #1 - Use RANDOM()
Strategy #2 - Pluck \& Array.sample
Strategy #3 - find_by with Min/Max ID
Strategy #4 - Random Offset
Strategy #5 - Where with Min/Max
Tips and Caveats
Your Thoughts?

If you’ve ever needed to collect a random record from a database whether using Rails or another framework/language, you may have found that there are several strategies to choose from. In this post, we’ll explore the top 5 strategies for collecting random records, along with some tips and caveats to keep in mind.

Why Get Random Records?

Before we dive into the strategies, it’s worth taking a moment to consider why someone might need to collect a random record from a database. Generally speaking random behavior is rarely a desired feature in an application, as it leads to non-deterministic results. However, it may be needed for business reasons.

One common use case is for displaying random content on a website or app, such as “random image” feature, or “random author” on Book tracking website. Another use case is for testing or debugging, where it may be useful to select a random record to check for edge cases or unexpected behavior.

Elephant programmer analyzing database queries and performance metrics

Strategy #1 - Use `RANDOM()`

The first strategy is to use the database’s built-in RANDOM function to select a random record. For PostgreSQL or SQLite, use RANDOM(), and for MySQL or MariaDB, use RAND(). While this strategy is the most robust, it can be slow as the table grows, taking around 260ms on a table with 1,000,000 records in my sample data.

Pros:

Most robust

Cons:

Slow. As the table grows, takes longer and longer (~260ms on my table)

Author.order('RANDOM()').limit(1)

Strategy #2 - Pluck & Array.sample

Another strategy is to select ALL IDs by plucking or selecting the IDs from the database and using Ruby’s Array#sample method to select one of the IDs at random. While this strategy is faster than strategy #1, it can still get slow as the table grows, taking around 50ms on a table with 1,000,000 records. Also consider the amount of memory such strategy might consume.

Pros:

Faster (~50ms on my table of 1,000,000 records)

Cons:

As the table grows, it will start getting slower, and you simply may run out of resources, depending on the size of the machine (i.e.: you’re running on a small VM)

Author.find(Author.pluck(:id).sample)

Strategy #3 - `find_by` with Min/Max ID

A third strategy is to get the minimum and maximum values of the IDs in the database, use rand(min..max) to generate a random ID, and then use find_by to look up the record. This strategy is extremely fast, taking around 17ms on a table with 1,000,000 records, but it can be brittle if there are deleted records.

Pros:

Extremely fast, if no deleted data (~17ms on my sample data)
If the table grows, it doesn’t affect the speed of the lookup

Cons:

Requires Numeric IDs, can’t use GUID or Varchar IDs
Needs sequential IDs without missing IDs
Can be brittle if any records were deleted and IDs are missing
- Error handling & additional lookups can mitigate the problem, but as the number of missing IDs becomes larger, this will slow as well

Author.find_by(rand(1..Author.count))

Strategy #4 - Random Offset

A fourth strategy is to use a random offset to look up a record. This strategy is not as fast as the others, taking around 120ms on a table with 1,000,000 records, but it doesn’t require error handling and works with any type of ID.

Pros:

No error handling needed, doesn’t depend on the type of ID

Cons:

Not as fast - about 120ms on average on my sample data, but depending on where the offset is in the table the actual queries can be anywhere between 20ms and 600ms. That’s just how offset works.

Author.offset(rand(1..Author.count)).limit(1)

Strategy #5 - Where with Min/Max

A fifth strategy, and my preferred strategy, is to get the minimum and maximum values of the IDs in the database, use rand(min..max) to generate a random ID, and then use a where lookup to select the record. This strategy is extremely fast, taking only 1-6ms on a table with 1,000,000 records, and it’s also robust, handling deleted records without issue.

Pros:

Extremely fast - I get 1-6ms results
If the table grows, it doesn’t affect the speed of the lookup (actually faster than offset)
If records get deleted, it will just find the first match after the deleted one
Can be modified into a scope, and used with other scopes - i.e Author.published_today.random_record (Just keep in mind that it’s essentially kind of reimplementing an offset in a way, which may limit your data set for a lookup). The reason we are doing it this way, is because offset will be slow because of how it works

Cons:

Not as clear as using just offset, but offset will get slower depending on how big your data is

# `random_id` Can be memoized / cached,
# but beware of scoping before caching(!!!)
# as you can get empty results where many are expected
random_id = rand(Author.minimum(:id)..Author.maximum(:id))
Author.where("id >= ?", random_id).limit(1)

Out of these strategies, my preferred strategy used to be the third one. Even if the table has some deleted records, it is still faster on average for my use-case. Our deletes from that table are also rare, and we use soft deletes that I can use in my error handling. I sometimes used strategy #4, as it also works well.

However, since I first used this code, I was answering a StackOverflow question that lead to this blog post, and came up with strategy #5 which is now my preferred option.

Ruby developer elephant working on database optimization code

Tips and Caveats

A couple of tips to keep in mind when implementing these strategies:

Caching Author.count:

If the actual current count is not critical, you can preload Author.count and memoize it as Author.total_count for example or keep it in an app-load config that gets loaded on app restart (if you do them often). This will remove the extra time needed to get Author.count and speed up option 3 to under 5ms and option 4 to about 100ms. Here’s an example of how to do this:

def self.total_count
  @total_count ||= count
end

Be Careful with Optimizations:

However, I would warn against going crazy with optimizations here, such as caching min/max values, as depending on the context, scoping, etc, it may lead to unexpected results. If you use strategy #5, the bottleneck is not going to be in the database, so making 3 quick queries is preferable, unless it really-really matters for your WebScale™ app.

Error Handling:

When it comes to error handling, you have a couple of options. You can do lookups until you find a record that exists, or you can do multiple lookups using #3 and select a random one from the result set. Depending on how many gaps you have in your data set, strategy #3 may not be the best choice.

Your Thoughts?

Did I miss any approaches, or made a glaring blunder? Please, comment and let me know :)

Originally published: dev.to/konung/5-strategies-for-random-records-from-db-5ed5

Cover Art: MidJourney AI for “random records from database” prompt

Last updated: February 20, 2023

ruby rails postgresql mysql database performance

Updated on February 20, 2023

5 Strategies for Random Records from DB

Table of Contents

Why Get Random Records?

Strategy #1 - Use RANDOM()

Strategy #2 - Pluck & Array.sample

Strategy #3 - find_by with Min/Max ID

Strategy #4 - Random Offset

Strategy #5 - Where with Min/Max

Tips and Caveats

Your Thoughts?

Strategy #1 - Use `RANDOM()`

Strategy #3 - `find_by` with Min/Max ID