• by MrPowers on 9/21/2021, 3:45:20 AM

    Here are the big tips I think the article missed:

    Use the new string dtype that requires way less memory, see this video: https://youtu.be/_zoPmQ6J1aE. object types are really memory hungry and this new type is a game changer.

    Use Parquet and leverage column pruning. `usecols` doesn't leverage column pruning. You need to use columnar file formats and specify the `columns` argument with `read_parquet`. You can never truly "skip" a column when using row based file formats like CSV. Spark optimizer does column projections automagically - you need to do them manually with Pandas.

    Use predicate pushdown filtering to limit the data that's read into the DataFrame, here's a blog post I wrote on this: https://coiled.io/blog/parquet-column-pruning-predicate-push...

    Use a technology like Dask (each partition in a Dask DataFrame is a Pandas DataFrame) that doesn't require everything to be stored in memory and can run computations in a streaming manner.

  • by mint2 on 9/21/2021, 12:22:38 AM

    It’s useful to point out pitfalls of lower precision too. I don’t usually see these articles go over that.

    Operations on a few million rows of float32s can give strange results. For example when summing. Df[“colA”].sum() can be fairly different than df.sort_index()[“colA”].sum(). It’s a trap for the unwary.