Join our FREE personalized newsletter for news, trends, and insights that matter to everyone in America

Newsletter
New

Why Is This Pandas Rolling Function Exhibiting Nonlinear Slowdown When Run On The Cloud?

Card image cap

I moved a library for stock market analysis to the cloud and found strange performance with a key pandas function. It runs reasonably quickly on my low-performance Dell XPS laptop and exhibits expected linear time-to-finish.

When I ran the same code on CodeAnywhere and Google Colab, it tanked.

The example rolling function calculates the 20th percentile of all "high" prices using intraday data, limited to the last 20 business days of data strictly from prior dates.

Here is the code I'm using to test

from data_access import backfill_historical_intraday_data, form_full_data  
from time import time  
import pandas as pd  
from pandas.api.indexers import VariableOffsetWindowIndexer  
  
my_ticker = 'DBL'  
  
my_intraday_data = form_full_data(handle=my_ticker)  
  
my_data = my_intraday_data[my_intraday_data['date'] > '2020-01-01']  
my_data.set_index('date', inplace=True)  
  
for k in range(8):  
    my_length = 10 * (3 ** k)  
    start = time()  
    my_trial_data = my_data.copy().iloc[-1 * my_length:]  
    offset = pd.offsets.BDay(20)  
    indexer = VariableOffsetWindowIndexer(index=my_trial_data.index, offset=offset)  
    my_low_highs = my_trial_data['high'].rolling(indexer, closed='left').quantile(0.2)  
    end = time()  
    print(f'Process with df length of {my_length} took {end - start} seconds.')  
  

The whole dataframe uses 4.4 M of memory.

Here are the results using my cheap XPS:

Process with df length of 10 took 0.004008054733276367 seconds.  
Process with df length of 30 took 0.003987312316894531 seconds.  
Process with df length of 90 took 0.006006479263305664 seconds.  
Process with df length of 270 took 0.00899958610534668 seconds.  
Process with df length of 810 took 0.02326822280883789 seconds.  
Process with df length of 2430 took 0.07396459579467773 seconds.  
Process with df length of 7290 took 0.23105525970458984 seconds.  
Process with df length of 21870 took 0.7741758823394775 seconds.  

Same code on Colab (immediately after buying 100 compute credits):

Process with df length of 10 took 0.003057718276977539 seconds.  
Process with df length of 30 took 0.0040819644927978516 seconds.  
Process with df length of 90 took 0.004738807678222656 seconds.  
Process with df length of 270 took 0.01237034797668457 seconds.  
Process with df length of 810 took 0.02781844139099121 seconds.  
Process with df length of 2430 took 0.22655129432678223 seconds.  
Process with df length of 7290 took 6.735988140106201 seconds.  
Process with df length of 21870 took 51.67249917984009 seconds.  

Up until a length of 810, the two timings matched very closely. But then a 3x increase in length lead to a 10x increase in time, as though the process became O(N^2) at that point but not prior.

There is no RAM issue. This is a small df and the Colab resource monitor never goes above 10% utilization.

Any ideas why two different cloud computing platforms exhibit this nonlinear performance on such a tiny dataset?