Greetings,
I ported part of my AFL backtest code to a plugin, that
takes
advantage of the graphics math cores on the video card that are
normally used for 3d graphics.
I was able to get a several
thousand fold performance improvement
over AFL code alone.
My goal
was to reduce the 25 seconds AFL code alone uses for a single
portfolio
level back test to less than 1 second, allowing multi day
optimization and
walkforward runs to complete in a more reasonable
time, and also just to
see how fast I could get it to run.
The backtest runs over 1 year of 5
minute bars for about 1000
symbols. 1 year of data normally takes 25
seconds for AmiBroker
alone, or 18 seconds for 6 months of data. A typical
optimization
run takes hundreds of these passes per walk forward step,
taking
hours.
Using the Nvidia CUDA API, running on my mid range
video card. It
was much faster. Much, much, much faster. How
fast?
It reduced the run time from 25s to... 4.4ms. That is more than
200/s!
I didnt believe the timing when I saw it at first. So, I
put 1,000
runs in a loop and sure enough, it ran 1,000 iterations in about
4
1/2 seconds. This far exceeded my gaol or expectations.
The
resulting trade list matches that obtained by the AFL version of
this
code.
I estimate that it is processing 32GB of bar
data/sec.
Getting this to work at peak performance was tricky. Most of
what I
have learned about code optimization does not apply.
It
uses AmiBroker to load the symbol data and perform calculations
that do
not depend on the optimization parameters. Once loaded into
video memory,
repeated passes can be made with different parameters,
avoiding any
overhead.
For non backtest/optimization runs, the code just
evaluates one
symbol and passes the data back to AmiBroker
buy/sell/short/cover
arrays, making it easy to test, validate and
visualize the trades.
There is very little performance gain in this case.
There are problems, however. To run optimizations at peak speed, I
can not use AmiBroker to calculate the optimization goal function.
So,
I am in the process of writing code to match signals and
calculate the
portfolio fitness function. Once I do this, I will be
able to perform full
optimizations and walk forwards at 3 orders of
magnitude faster than is
possible with AmiBroker alone.
Also, this is not general purpose code.
Changing the system code
means changing a dll written in C. However, there
is no reason that
this could not be made more general.
I have made
some prototypes of "Cuda" versions of basic AFL
functions. The idea is to
queue the function calls into a definition
executed by a micro kernel
running on the graphics cores. The result
would be the ability to use the
full power of the graphics cores by
modifying AFL code to use Cuda aware
versions with no changes to C
code. It would be an interesting, but big
project.