RE: [amibroker] Re: Freakishly fast backtest using 64 cores, AmiBroker Email List Archive

Very good question. That was a head scratcher.

So the thing is, AmiBroker does a lot more work in a optimization
pass then execute AFL code. In fact, the AFL code may take very
little of the total run time.

As an example, using a database with good amount of data, write a afl
file that does nothing buy set the buy/sell/short/cover arrays to 0.
The backtest will still take a good bit of time.

So, even reducing the AFL run time to zero is not enough. It will
not help much at all.

So, to avoid this, I pass a "mode" variable to my Dll. This mode is
set by a simple optimization statement:

mode = optimize("mode",0,1,3,1);

When mode = 0, the dll will evaluate one symbol like a normal dll.
So if I click on a bar, it will update my printf statements, etc.
buy/sell/short/cover arrays are set. A single backtest (not
optimize) will use the normal AmiBroker trade match and evaluate code
and generate stats as normal.

When mode = 1, this means load the data. The Dll will copy the price
data to a stage area in memory. buy/sell/short/cover are set to 0 to
generate no trades. Having AmiBroker align the symbol bars was a big
help here.

When mode = 2, on the first symbol and the first symbol only, it
loads the price data to the video card and executes as many backtest
passes as it needs at a few ms per pass. Once the best combination
is found it returns. buy/sell/short/cover are set to 0. Note that I
can not use the Amibroker signal match and fitness function code. I
have to provide my own. This is where the performance advantage of
all of the extra cores come into play. It may run hundreds or
thousands of parameter combinations very quickly. I cant use the
built in optimize suppport, but brute force is enough for now. After
all, I get 200 combinations per second.

When mode = 3, each symbol evaluates using the best parms found on
the last mode=2 run. buy/sell/short/cover are set. In a walkforward
test, this will always have the best score and be used for the
walkforward step. A custom backtest function adds the chosen
parameters to the backtest report. Mode 3 works like mode 0 except
it uses the optimal parameters rather than defualt values.

The action("status") and action("statusex") codes could also be used,
but they did not tell me quite what I needed to know. Also, I could
have avoided the mode=2 step if I could find a way to know I was on
the last symbol and run the optimization then. I guess I could pass
the name of the last symbol.

So I use AmiBroker to load and keep the datbase, visualize the
trades, validate, walkforward and provide deep metrics of the
backtest.

If I wanted to take this further, I would move the trade system logic
out of the Dll and make it programable from Afl. That way it could
be used by anyone without needing to program C. I would do this by
passing handles to cuda arrays through the Afl code.

--- In amibroker@xxxxxxxxxps.com, "Paul Ho" <paul.tsho@x..> wrote:
>
> thanks for your insight.
> I hope you dont mind sharing a little bit more detail
> You said "
> Get get the best performance, my AFL code makes one pass over the
> > data, calling a Dll. The Dll takes all of the data needed by the
> > calculation and loads a copy to the video card. This upload is
> slow,
> > the entire upload takes about 45 seconds for all 1000 symbols.
> >
> > Once all of the data is uploaded, the Dll loads a "kernel" into
the
> > graphics cores that perform the actual computation and generates
> the
> > trade list.
>
> normally AB loads the data from database as needed, and calls a
> function in a dll, and passes data in arrays or whatever as
arguments
> of the function. The function will be called for every ticker in
the
> watchlist, and data pertaining that symbol is passed each time. I
> wonder how you do a "single pass" over the data. Because AB passes
> the data as part of the argument regardless of how many
optimizations
> It had previously with the same data. I just wonder you do it.
> cheers
> Paul.
>
> --- In amibroker@xxxxxxxxxps.com, "dloyer123" <dloyer123@> wrote:
> >
> > This uses the mid range video card that happened to come with my
> > system, a 9800GT. The newer 260 and 280 cards are 3 to 4 times
> > faster. The 260 can be found at best buy for $300. Some laptops
> > have compatible cards as well.
> >
> > The video card has its own memory, mine has 512MB, some have as
> much
> > as 1GB. This memory is very fast, once it is loaded from the
main
> > system. Nvidia has a professional line of products that have
much
> > more memory.
> >
> > Get get the best performance, my AFL code makes one pass over the
> > data, calling a Dll. The Dll takes all of the data needed by the
> > calculation and loads a copy to the video card. This upload is
> slow,
> > the entire upload takes about 45 seconds for all 1000 symbols.
> >
> > Once all of the data is uploaded, the Dll loads a "kernel" into
the
> > graphics cores that perform the actual computation and generates
> the
> > trade list. This part is very fast and performs all of the same
> > functions that my AFL version does. The resulting trade list is
> the
> > same.
> >
> > Because the data loaded into video memory, it can be resused for
> many
> > passes over the data with different optimization values. So,
> > hundreds of combinations of optimization values can be tried per
> > second.
> >
> > For non optimization runs, the Dll just loads one symbol into
video
> > memory and processes it. Counting the overhead of moving data to
> the
> > video card and extracting the trade list for a single symbol, the
> > result is similar to AFL code alone. This lets me test the code
> and
> > make sure it is correct.
> >
> > This approach works best when the data only needs to be loaded
> once,
> > then "resused" many times. It also works best when there is a
lot
> of
> > data to work with.
> >
> > What is more interesting to me and what would be more useful for
> > others would be a general drive that requires no Dll changes to
> > modify the system. The performance would not be as good as hand
> > optimized code, but would still be much better than AFL code
> alone.
> > It would take trading system design to a whole new level. It
would
> > provide enough performance to make working with Intra day data as
> > easy as daily data is today.
> >
> > Writing such a driver would be hard, but I have already done some
> > prototypes and design work. I am tempted to do it for my own
use.
> > If I made it available to others supporting it would be a PITA.
> >
> >
> >
> >
> > --- In amibroker@xxxxxxxxxps.com, "Paul Ho" <paul.tsho@> wrote:
> > >
> > > I'm very interested
> > > could you elaborate a bit more
> > > What model of Nvidia chipset are you using, and with how much
> > memory?
> > > Not sure exactly what you mean when you say
> > > It uses AmiBroker to load the symbol data and perform
> calculations
> > > that do not depend on the optimization parameters. Once loaded
> into
> > > video memory, repeated passes can be made with different
> > parameters,
> > > avoiding any overhead.
> > > Can you give me some examples. I presume when your dll is
called.
> > AB passes
> > > one or more arrays of data belonging to 1 symbol, is that true?
> > > Not sure exactly what the rest mean either. How many functions
> are
> > you
> > > running in your dll, and what does each of the do?
> > > Great of you to share your insight.
> > > Cheers
> > > Paul.
> > >
> > >
> > >
> > > _____
> > >
> > > From: amibroker@xxxxxxxxxps.com
> [mailto:amibroker@xxxxxxxxxps.com]
> > On Behalf
> > > Of dloyer123
> > > Sent: Tuesday, 5 August 2008 9:19 AM
> > > To: amibroker@xxxxxxxxxps.com
> > > Subject: [amibroker] Freakishly fast backtest using 64 cores
> > >
> > >
> > >
> > > Greetings,
> > >
> > > I ported part of my AFL backtest code to a plugin, that takes
> > > advantage of the graphics math cores on the video card that are
> > > normally used for 3d graphics.
> > >
> > > I was able to get a several thousand fold performance
improvement
> > > over AFL code alone.
> > >
> > > My goal was to reduce the 25 seconds AFL code alone uses for a
> > single
> > > portfolio level back test to less than 1 second, allowing multi
> day
> > > optimization and walkforward runs to complete in a more
> reasonable
> > > time, and also just to see how fast I could get it to run.
> > >
> > > The backtest runs over 1 year of 5 minute bars for about 1000
> > > symbols. 1 year of data normally takes 25 seconds for AmiBroker
> > > alone, or 18 seconds for 6 months of data. A typical
optimization
> > > run takes hundreds of these passes per walk forward step,
taking
> > > hours.
> > >
> > > Using the Nvidia CUDA API, running on my mid range video card.
It
> > > was much faster. Much, much, much faster. How fast?
> > >
> > > It reduced the run time from 25s to... 4.4ms. That is more than
> > > 200/s!
> > >
> > > I didnt believe the timing when I saw it at first. So, I put
> 1,000
> > > runs in a loop and sure enough, it ran 1,000 iterations in
about
> 4
> > > 1/2 seconds. This far exceeded my gaol or expectations.
> > >
> > > The resulting trade list matches that obtained by the AFL
version
> > of
> > > this code.
> > >
> > > I estimate that it is processing 32GB of bar data/sec.
> > >
> > > Getting this to work at peak performance was tricky. Most of
what
> I
> > > have learned about code optimization does not apply.
> > >
> > > It uses AmiBroker to load the symbol data and perform
> calculations
> > > that do not depend on the optimization parameters. Once loaded
> into
> > > video memory, repeated passes can be made with different
> > parameters,
> > > avoiding any overhead.
> > >
> > > For non backtest/optimization runs, the code just evaluates one
> > > symbol and passes the data back to AmiBroker
buy/sell/short/cover
> > > arrays, making it easy to test, validate and visualize the
> trades.
> > > There is very little performance gain in this case.
> > >
> > > There are problems, however. To run optimizations at peak
speed,
> I
> > > can not use AmiBroker to calculate the optimization goal
> function.
> > > So, I am in the process of writing code to match signals and
> > > calculate the portfolio fitness function. Once I do this, I
will
> be
> > > able to perform full optimizations and walk forwards at 3
orders
> of
> > > magnitude faster than is possible with AmiBroker alone.
> > >
> > > Also, this is not general purpose code. Changing the system
code
> > > means changing a dll written in C. However, there is no reason
> that
> > > this could not be made more general.
> > >
> > > I have made some prototypes of "Cuda" versions of basic AFL
> > > functions. The idea is to queue the function calls into a
> > definition
> > > executed by a micro kernel running on the graphics cores. The
> > result
> > > would be the ability to use the full power of the graphics
cores
> by
> > > modifying AFL code to use Cuda aware versions with no changes
to
> C
> > > code. It would be an interesting, but big project.
> > >
> >
>