I find it very telling that both memory hotspots were explicit loops, and the fixes were to use implicit looping constructs (range and a list comprehension). I wonder if the underlying problem was the loop, or the constant append() calls.
That's an interesting point - there are competing factors here. The list comprehension is usually a bit more efficient due to not having to look up the local variable for the list on every iteration, I think. Take a look at this (albeit a bit old) article:
I imagine that lookup overhead would happen on every iteration, but the list add only happens when the conditional evaluates to true - so it seems a bit less likely that the append is the overhead, but maybe the cost of allocating memory is so bad (it is a syscall after all) that it compensates for the infrequency.
Generators would help with this too: instead of materializing and appending the list at the end every time (i.e. [2] + [x for x in s if x]), you could do something like itertools.chain([2], (x for x in s if x)). Then you avoid ever allocating a massive list -- provided that the rest of your code avoids that too.
Well, the fact that they're running it in PyPy makes all of this that much more challenging to debug - you'd have to look at the runtime-generated bytecode to identify the actual causes.