Hacker News Clone

Tips for saving memory with Pandas

by bigsassy on 9/17/2021, 2:09:30 PM with 2 comments

by MrPowers on 9/21/2021, 3:45:20 AM
Here are the big tips I think the article missed:
Use the new string dtype that requires way less memory, see this video: https://youtu.be/_zoPmQ6J1aE. object types are really memory hungry and this new type is a game changer.
Use Parquet and leverage column pruning. `usecols` doesn't leverage column pruning. You need to use columnar file formats and specify the `columns` argument with `read_parquet`. You can never truly "skip" a column when using row based file formats like CSV. Spark optimizer does column projections automagically - you need to do them manually with Pandas.
Use predicate pushdown filtering to limit the data that's read into the DataFrame, here's a blog post I wrote on this: https://coiled.io/blog/parquet-column-pruning-predicate-push...
Use a technology like Dask (each partition in a Dask DataFrame is a Pandas DataFrame) that doesn't require everything to be stored in memory and can run computations in a streaming manner.
by mint2 on 9/21/2021, 12:22:38 AM
It’s useful to point out pitfalls of lower precision too. I don’t usually see these articles go over that.
Operations on a few million rows of float32s can give strange results. For example when summing. Df[“colA”].sum() can be fairly different than df.sort_index()[“colA”].sum(). It’s a trap for the unwary.