Several blog posts and GitHub
repos try to benchmark packages for dataframe manipulation
(data.table, dplyr, duckplyr,
etc.). When it comes to benchmarking tidypolars, I have
seen several implementation mistakes that make the results quite
unreliable. This is not to say that tidypolars will always
be faster than alternatives. In fact, there is rarely one tool that is
better than all the others in all aspects of dataframe manipulation. The
goal of this vignette is to give some advice on how best to benchmark
tidypolars.
Summary: instead of this code:
my_function <- function(dat) {
my_polars_data <- as_polars_df(dat)
my_polars_data |>
<some slow operation>
}
bench::mark(
my_function(my_r_data_frame)
)use this code:
my_polars_data <- as_polars_lf(my_r_data_frame)
my_function <- function(dat) {
my_polars_data |>
<some slow operation> |>
compute() # or compute(engine = "streaming")
}
bench::mark(
my_function(my_polars_data)
)as_polars_df() or
as_polars_lf() in the timingSome benchmarks do something like the following:
my_function <- function(dat) {
dat <- as_polars_df(dat)
dat |>
<some slow operation>
}
bench::mark(
my_function(my_r_data_frame)
)The issue with this approach is that as_polars_df()
converts the R data.frame to a Polars DataFrame, which
takes some time. As highlighted
in the “Getting started” vignette, as_polars_df() and
as_polars_lf() are convenience functions for demo and
testing purposes. The recommended way to use tidypolars is
by using the dedicated readers (scan_parquet_polars(),
read_parquet_polars(), etc.) to import the data directly as
a Polars DataFrame or LazyFrame.
Using as_polars_df() and as_polars_lf() is
fine to get the data ready to be benchmarked, but those operations
should not be included in the timing.
Polars provides DataFrames and LazyFrames. Operations on DataFrames are executed in “eager mode”, meaning that there is no optimization happening behind the scenes, for instance to efficiently reorder operations. In real-life workflows, it is strongly recommended to use LazyFrames since they allow for a large number of optimizations.
In benchmarks or demos, you should therefore prefer
as_polars_lf() over as_polars_df(). Using
LazyFrames means that you also need to collect results to trigger
computation. This can be done using compute() (to return a
Polars DataFrame), collect() (to return an R
data.frame), or as_tibble() (to return a
tibble).
Depending on the objective of the benchmark, each of the three
options can be valid. compute() is faster because it avoids
the extra step of converting a Polars DataFrame to R. However, the
output of compute() cannot be passed directly to
ggplot2, for example.
In summary, instead of doing:
my_polars_data <- as_polars_df(my_r_data_frame)
my_function <- function(dat) {
my_polars_data |>
<some slow operation>
}
bench::mark(
my_function(my_polars_data)
)you should do:
In addition to lazy execution, Polars comes with a streaming engine that is able to perform operations on larger-than-RAM data. It is also faster than the default engine in many cases, not just with larger-than-RAM data.
Note: at the time of writing (August 2025), the streaming engine is not yet used by default, but it should become the default in the next few months.
Using it is extremely simple: pass engine = "streaming"
in compute(), collect() or
as_tibble().
Some benchmarks take advantage of tools such as future
or mirai to run code in parallel. However,
tidypolars shouldn’t be used with such tools. All
polars code that is used internally is already optimized to
run on all cores and isn’t guaranteed to play well with other parallel
frameworks.
In some cases, you may want to use those frameworks on Polars objects, but this should be the exception rather than the rule.