-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I want a reliable way to understand why a single pandas operation is slow or uses far more memory than I expect. When a merge, groupby, sort, or rolling call suddenly takes minutes or crashes with out of memory, pandas completes the work or fails without telling me what path it took, how many temporary arrays were created, or where the time went. That leaves me guessing, rewriting code, and trying different dtypes and arguments until something improves.
External profilers can show Python level hotspots, but they do not answer the pandas specific questions I need in the moment. I need something that can tell me whether a join built a large hash table, whether groupby spent most of its time building the grouper, whether a sort triggered large copies, and whether Copy on Write forced unexpected materialization. With that information I could make simple changes like adjusting key dtypes, choosing observed or sort settings, reducing columns earlier, or validating join cardinality. Without it, debugging performance and memory spikes is slow, fragile, and hard to reproduce.
Feature Description
A practical way to build this is to introduce an opt in collector that wraps a single pandas operation and produces a compact report. When diagnostics are turned off, nothing happens apart from one cheap check, so there is no meaningful overhead in normal use.
The public entry point can be pd.diagnostics.report(func, *args, **kwargs) with an optional context manager for the same behavior. Internally, the wrapper sets a thread local reference to the active collector, starts wall time measurement and optional memory tracking, executes the operation, then stops tracking and clears the thread local. After the call completes, the collector freezes the captured metrics into a DiagnosticsReport object, which is attached to the returned result via a private attribute. A public DataFrame.explain() method can then format that report as text or json.
On the pandas internals side, the only required plumbing is a small helper such as pd.diagnostics._collector() that returns None when diagnostics are disabled. Hot paths can call this helper and, when it is present, record a few coarse phases using begin and end markers, plus counts for allocations and copies at places where pandas already allocates new arrays or forces a copy. Instrumentation can start with merge, groupby, sort, and rolling. Memory reporting can rely on tracemalloc as a best effort estimate and remain optional because it adds overhead.
Alternative Solutions
Today the closest options are general purpose profilers and memory tools. We can time operations with timeit or perf_counter, profile CPU with cProfile or py spy, and track allocations with tracemalloc or external packages like memory profiler or scalene. These tools are useful, but they are not pandas aware, so they do not explain which internal path was taken, how many intermediate arrays were created, or what simple argument and dtype changes would help.
Pandas itself offers pieces like DataFrame.info() and memory_usage() plus performance guidance in the docs, and merge validation can catch some join blowups early, but none of these provide an on demand structured report for a single operation.
Additional Context
The idea was basically from other existing frameworks. For e.g. Apache Spark has a built in DataFrame.explain(), polars has LazyFrame.explain(), duckdb has EXPLAIN, etc.
Please let me know if the idea sounds good, i can dedicate some time and start working on the PR.