• by chrisjc on 10/27/2021, 8:42:10 AM

    I'm a little confused with Arrow being included in the comparison.

    What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format?

    But isn't Arrow a format too? As I understand it, Arrow is a format optimized for transferring in-memory data from one distributed system to another (ser-des), while also facilitating and optimizing certain set operations. From RAM in one system to RAM in another.

    Moreover, since Arrow is a format, why is it being compared to databases like SQLite and DuckDB? If we're talking about formats, why not compare Arrow queries against Parquet data to DuckDB queries against Parquet data? https://duckdb.org/docs/data/parquet

    Why not at least benchmark the query execution alone instead of startup and loading of data? For Arrow, isn't it assumed that there is an engine like Spark or Snowflake already up and running that's serving you data in the Arrow format? Ideally, with Arrow you should never be dealing with data starting in a resting format like Parquet. The data should already be in RAM to reap the benefits of Arrow. Its value proposition is it'll get "live" data from point A to B as efficiently as possible, in an open, non-proprietary, ubiquitous (eventually) format.

    Exactly what of SQLite, DuckDB and Arrow is being compared here?

    I would assume the benefits of Arrow in R (or DataFrames in general) would be getting data from a data engine into your DataFrame runtime as efficiently as possible. (just as interesting might be where and how push-downs are handled)

    Perhaps I'm missing the trees for the forest?

    No disrespect to the author... Seems like they're on a quest for knowledge, and while the article is confusing to me, it certainly got me thinking.

    Disclaimer: I don't read R too good, and I'm still struggling with what exactly Arrow is. (Comparisons like this actually leave me even more confused about what Arrow is)

  • by isoprophlex on 10/27/2021, 7:00:04 AM

    The relevant results from the linked article:

        ##    format          median_time  mem_alloc
        ## 1  R (RDS)               1.34m     4.08GB
        ## 2  SQL (SQLite)          5.48s     6.17MB
        ## 3  SQL (DuckDB)          1.76s   104.66KB
        ## 4  Arrow (Parquet)       1.36s   453.89MB
    
    I'd bet that doing the same with Pandas would require time and space similar to RDS (1). I really hope DuckDB makes it in the Python world, everything I read about it seems very promising. Using it myself for toy projects was pleasant, too.

  • by kristjansson on 10/27/2021, 5:04:39 PM

    Lots of confusion derives from the best-of-breed parquet readers for Python and R residing in the Arrow packages, mostly because Arrow is (and does) a lot of things.

    There's:

    * arrow, a in-memory format for dataframe-shaped things supporting fast computation, zero-copy sharing, etc.

    * arrow Feather V1, an on-disk format for dataframe-shaped things

    * arrow IPC, a (de)serialization for arrow buffers and a protocol for sending/receiving to other processes.

    * arrow Feather V2, an on-disk format that's basically the IPC serialization written to file[1]

    * arrow Flight, a protocol for requesting/sending/receiving data to remote machines that's basically gRPC layered over the IPC format

    * arrow DataFusion/Ballista, nascent system(s) for local/distributed query execution over arrow buffers

    * other subprojects I'm surely forgetting

    * a (very good) C++ Parquet reader [2]/[3] developed under the auspices of the project

    * libraries in / bindings to many languages, including R and Python, supporting interaction with (subsets of) the above.

    It's only the last piece that's exposed to most data-science-y users, and thus identified with the 'arrow' name. Since those libraries are also very good, and hiding their abstractions well, those users are free to use the functionality relevant to them, be it dealing with parquet, feather, etc. without needing to understand how they work.

    Not that this is a criticism of the project, or those users! Arrow encompasses lots of different functionality, which enables it to provide different things to different people. As a result, though, 'Arrow' connotes lots of different things (and different _sorts_ of things) to different users, which can cause some confusion if terms aren't fully specified, or even a bit misunderstood

    [1] https://stackoverflow.com/a/67911190/881025 [2] https://github.com/apache/parquet-cpp [3] https://github.com/apache/arrow/tree/master/cpp

  • by dgudkov on 10/27/2021, 9:23:49 AM

    DuckDB is faster than SQLite on an analytical type of query. That's expected because DuckDB is column-based and designed exactly for this type of queries.

    The analysis would be more telling if it measured insert performance as well.

  • by cosmaioan on 10/27/2021, 7:21:09 AM

    I would be curios to see DuckDB with parquet https://duckdb.org/docs/data/parquet

  • by ryndbfsrw on 10/27/2021, 7:56:45 AM

    I am certain I'm in the wrong here but I'm struggling to understand Arrow's USP. I (originally) assumed it meant python/R users would be able to get around memory limitations when model-fitting but all the examples I've come across are just data manipulation and none of the main modeling packages support it. Those who are using it, what am I missing?

  • by einpoklum on 10/27/2021, 7:47:31 AM

    Note that:

    * The comparison used a single, simple, query: filter, group by, and sum.

    * The comparison does not include repeated multiple queries (similar or dissimilar).

    * A moderate amount of data is used.

    This is certainly relevant and important, but it's not an extensive comparison.

  • by homerowilson on 10/27/2021, 1:23:39 PM

    I enjoyed this comparison, thanks! Here is a related generally R-centric comparison that you might enjoy of DuckDB, dplyr, data.table, etc. applied to five data-sciency problems I wrote up a few months ago: https://github.com/bwlewis/duckdb_and_r

  • by HighChaparral on 10/27/2021, 10:59:17 AM

    Possibly nothing more than a side issue, but why are the indexes being created before the bulk insert, rather than after?

  • by knome on 10/27/2021, 12:51:00 PM

    >dbSendQuery(con, "CREATE INDEX year ON yrpc (year)")

    >dbSendQuery(con, "CREATE INDEX reporter_iso ON yrpc (reporter_iso)")

    would sqlite use these for the query given?

    it seems they wouldn't help with grouping, and the grouping seems antagonistic to filtering. if you're trying for performance, you can avoid having the query engine refer back to the original table during processing by using a covering index that includes the data fields in addition to the query fields

    "CREATE INDEX covering (reporter_iso, year, trade_value_usd_exp, trade_value_usd_imp)"

    or maybe even something like

    "CREATE INDEX covering (reporter_iso, year, reporter_iso, trade_value_usd_exp, trade_value_usd_imp)"

    Though I'm not sure how it's query planner would do with a duplicate column in the index.

  • by CRConrad on 10/27/2021, 1:19:55 PM

    Here's another HN page on an article comparing SQLite to other DBs, from just the other day :-) https://news.ycombinator.com/item?id=7432619

  • by mjburgess on 10/27/2021, 7:09:04 AM

    Does the R code, for eg., SQLite, actually build an SQL query?

    This just looks like a tidyverse library comparison. I'd expect a benchmark using their own libs, rather than assuming tidyverse will have an optimal way of querying them.

  • by marcle on 10/27/2021, 6:21:03 AM

    TLDR: Arrow and DuckDB provide fast database aggregates compared with R's RDS format and, to an extent, SQLite.

    It is unclear how much functionality is available for Arrow under R: any comments? It would also be interesting to see a similar benchmark for Python, which could include the embedded version of MonetDB -- an R package for MonetDB/e is not yet available.

    Edit: amended the TLDR to reflect jhoechtl's and wodenokoto's comments. SQLite provided reasonably memory efficient aggregates.

  • by flakiness on 10/27/2021, 12:05:40 PM

    Although this article is focusing on dplyr which provides an uniform API over various data sources, I guess the choice depends more on the personal taste: Do you like staying on R / Python or on SQL?

    If you prefer SQL, using raw Arrow or in-memory data doesn't make sense - Except duckdb does support SQL queries over these non-db data sources as well.

    Anyway, R data ecosystem very rich for both R and SQL fans and I as a Python user feel a bit envious.

  • by vertere on 10/27/2021, 12:37:27 PM

    I'm a bit surprised the DuckDB file size is that large (5x Parquet, and nearly as large as sqlite). I haven't used DuckDB, but I thought it had columnar storage and so would have better compression.

  • by xiaodai on 10/27/2021, 9:14:48 AM

    I think disk.frame should fare pretty well if you use the `srckeep` function.

  • by Fiahil on 10/27/2021, 9:46:37 AM

    So DuckDB is a kind of DataFusion (https://arrow.apache.org/datafusion/) ?

  • by WFHRenaissance on 10/27/2021, 1:19:49 PM

    Why aren't indexes being created on the RDS instance? I know it's out of scope based on the title, but also it's used for comparison in the final results table.

  • by m0zg on 10/27/2021, 8:46:00 AM

    Might as well try Arrow Feather, there's an R binding for it. This Flatbuffers-based format can be memory mapped and it requires very little decoding. Super fast and light on resources.

    Also, Parquet supports different encodings, some of which may be faster, denser, and/or consume less RAM, or all of the above.