Running several times the same benchmark seems to produce quite different results. Here is an example of what happens if I run elementary.yaml
but with 1000 iterations (100 iterations seems to be worst):

I put this here, but maybe this is related to Mathics-core, where something similar happens when we call many times
mathics -e 'Timing[Do[<test expression>,{1000}]][[1]]'