RE: [amibroker] Freakishly fast backtest using 64 cores, AmiBroker Email List Archive

Greetings,

I ported part of my AFL backtest code to a plugin, that takes
advantage of the graphics math cores on the video card that are
normally used for 3d graphics.

I was able to get a several thousand fold performance improvement
over AFL code alone.

My goal was to reduce the 25 seconds AFL code alone uses for a single
portfolio level back test to less than 1 second, allowing multi day
optimization and walkforward runs to complete in a more reasonable
time, and also just to see how fast I could get it to run.

The backtest runs over 1 year of 5 minute bars for about 1000
symbols. 1 year of data normally takes 25 seconds for AmiBroker
alone, or 18 seconds for 6 months of data. A typical optimization
run takes hundreds of these passes per walk forward step, taking
hours.

Using the Nvidia CUDA API, running on my mid range video card. It
was much faster. Much, much, much faster. How fast?

It reduced the run time from 25s to... 4.4ms. That is more than
200/s!

I didnt believe the timing when I saw it at first. So, I put 1,000
runs in a loop and sure enough, it ran 1,000 iterations in about 4
1/2 seconds. This far exceeded my gaol or expectations.

The resulting trade list matches that obtained by the AFL version of
this code.

I estimate that it is processing 32GB of bar data/sec.

Getting this to work at peak performance was tricky. Most of what I
have learned about code optimization does not apply.

It uses AmiBroker to load the symbol data and perform calculations
that do not depend on the optimization parameters. Once loaded into
video memory, repeated passes can be made with different parameters,
avoiding any overhead.

For non backtest/optimization runs, the code just evaluates one
symbol and passes the data back to AmiBroker buy/sell/short/cover
arrays, making it easy to test, validate and visualize the trades.
There is very little performance gain in this case.

There are problems, however. To run optimizations at peak speed, I
can not use AmiBroker to calculate the optimization goal function.
So, I am in the process of writing code to match signals and
calculate the portfolio fitness function. Once I do this, I will be
able to perform full optimizations and walk forwards at 3 orders of
magnitude faster than is possible with AmiBroker alone.

Also, this is not general purpose code. Changing the system code
means changing a dll written in C. However, there is no reason that
this could not be made more general.

I have made some prototypes of "Cuda" versions of basic AFL
functions. The idea is to queue the function calls into a definition
executed by a micro kernel running on the graphics cores. The result
would be the ability to use the full power of the graphics cores by
modifying AFL code to use Cuda aware versions with no changes to C
code. It would be an interesting, but big project.