Very good question. That was a head scratcher.
So the thing is,
AmiBroker does a lot more work in a optimization
pass then execute AFL
code. In fact, the AFL code may take very
little of the total run
time.
As an example, using a database with good amount of data, write a
afl
file that does nothing buy set the buy/sell/short/cover arrays to
0.
The backtest will still take a good bit of time.
So, even
reducing the AFL run time to zero is not enough. It will
not help much at
all.
So, to avoid this, I pass a "mode" variable to my Dll. This mode
is
set by a simple optimization statement:
mode =
optimize("mode",0,1,3,1);
When mode = 0, the dll will
evaluate one symbol like a normal dll.
So if I click on a bar, it will
update my printf statements, etc.
buy/sell/short/cover arrays are
set. A single backtest (not
optimize) will use the normal AmiBroker trade
match and evaluate code
and generate stats as normal.
When mode =
1, this means load the data. The Dll will copy the price
data to a stage
area in memory. buy/sell/short/cover are set to 0 to
generate no
trades. Having AmiBroker align the symbol bars was a big
help
here.
When mode = 2, on the first symbol and the first symbol only, it
loads the price data to the video card and executes as many backtest
passes as it needs at a few ms per pass. Once the best combination
is
found it returns. buy/sell/short/cover are set to 0. Note that I
can
not use the Amibroker signal match and fitness function code. I
have to
provide my own. This is where the performance advantage of
all of the
extra cores come into play. It may run hundreds or
thousands of parameter
combinations very quickly. I cant use the
built in optimize suppport, but
brute force is enough for now. After
all, I get 200 combinations per
second.
When mode = 3, each symbol evaluates using the best parms found
on
the last mode=2 run. buy/sell/short/cover are set. In a
walkforward
test, this will always have the best score and be used for the
walkforward step. A custom backtest function adds the chosen
parameters to the backtest report. Mode 3 works like mode 0 except
it
uses the optimal parameters rather than defualt values.
The
action("status") and action("statusex") codes could also be used,
but they did not tell me quite what I needed to know. Also, I could
have avoided the mode=2 step if I could find a way to know I was on
the last symbol and run the optimization then. I guess I could pass
the name of the last symbol.
So I use AmiBroker to load and keep
the datbase, visualize the
trades, validate, walkforward and provide deep
metrics of the
backtest.
If I wanted to take this further, I would
move the trade system logic
out of the Dll and make it programable from
Afl. That way it could
be used by anyone without needing to program C. I
would do this by
passing handles to cuda arrays through the Afl code.
--- In amibroker@xxxxxxxxxps.com,
"Paul Ho" <paul.tsho@x..> wrote:
>
> thanks for your
insight.
> I hope you dont mind sharing a little bit more detail
>
You said "
> Get get the best performance, my AFL code makes one pass
over the
> > data, calling a Dll. The Dll takes all of the data
needed by the
> > calculation and loads a copy to the video card.
This upload is
> slow,
> > the entire upload takes about 45
seconds for all 1000 symbols.
> >
> > Once all of the data
is uploaded, the Dll loads a "kernel" into
the
> > graphics
cores that perform the actual computation and generates
> the
>
> trade list.
>
> normally AB loads the data from database as
needed, and calls a
> function in a dll, and passes data in arrays or
whatever as
arguments
> of the function. The function will be
called for every ticker in
the
> watchlist, and data pertaining
that symbol is passed each time. I
> wonder how you do a "single pass"
over the data. Because AB passes
> the data as part of the argument
regardless of how many
optimizations
> It had previously with the
same data. I just wonder you do it.
> cheers
> Paul.
>
> --- In amibroker@xxxxxxxxxps.com,
"dloyer123" <dloyer123@> wrote:
> >
> > This uses the
mid range video card that happened to come with my
> > system, a
9800GT. The newer 260 and 280 cards are 3 to 4 times
> > faster. The
260 can be found at best buy for $300. Some laptops
> > have
compatible cards as well.
> >
> > The video card has its
own memory, mine has 512MB, some have as
> much
> > as 1GB.
This memory is very fast, once it is loaded from the
main
> >
system. Nvidia has a professional line of products that have
much
>
> more memory.
> >
> > Get get the best performance, my
AFL code makes one pass over the
> > data, calling a Dll. The Dll
takes all of the data needed by the
> > calculation and loads a copy
to the video card. This upload is
> slow,
> > the entire
upload takes about 45 seconds for all 1000 symbols.
> >
>
> Once all of the data is uploaded, the Dll loads a "kernel" into
the
> > graphics cores that perform the actual computation and generates
> the
> > trade list. This part is very fast and performs all
of the same
> > functions that my AFL version does. The resulting
trade list is
> the
> > same.
> >
> >
Because the data loaded into video memory, it can be resused for
> many
> > passes over the data with different optimization values. So,
> > hundreds of combinations of optimization values can be tried per
> > second.
> >
> > For non optimization runs,
the Dll just loads one symbol into
video
> > memory and
processes it. Counting the overhead of moving data to
> the
>
> video card and extracting the trade list for a single symbol, the
> > result is similar to AFL code alone. This lets me test the code
> and
> > make sure it is correct.
> >
> >
This approach works best when the data only needs to be loaded
> once,
> > then "resused" many times. It also works best when there is a
lot
> of
> > data to work with.
> >
>
> What is more interesting to me and what would be more useful for
>
> others would be a general drive that requires no Dll changes to
>
> modify the system. The performance would not be as good as hand
>
> optimized code, but would still be much better than AFL code
>
alone.
> > It would take trading system design to a whole new level.
It
would
> > provide enough performance to make working with
Intra day data as
> > easy as daily data is today.
> >
> > Writing such a driver would be hard, but I have already done
some
> > prototypes and design work. I am tempted to do it for my
own
use.
> > If I made it available to others supporting it
would be a PITA.
> >
> >
> >
> >
> > --- In amibroker@xxxxxxxxxps.com,
"Paul Ho" <paul.tsho@> wrote:
> > >
> > > I'm
very interested
> > > could you elaborate a bit more
> >
> What model of Nvidia chipset are you using, and with how much
>
> memory?
> > > Not sure exactly what you mean when you
say
> > > It uses AmiBroker to load the symbol data and perform
> calculations
> > > that do not depend on the
optimization parameters. Once loaded
> into
> > > video
memory, repeated passes can be made with different
> > parameters,
> > > avoiding any overhead.
> > > Can you give me
some examples. I presume when your dll is
called.
> > AB
passes
> > > one or more arrays of data belonging to 1 symbol, is
that true?
> > > Not sure exactly what the rest mean either. How
many functions
> are
> > you
> > > running in
your dll, and what does each of the do?
> > > Great of you to
share your insight.
> > > Cheers
> > > Paul.
>
> >
> > >
> > >
> > > _____
> > >
> > > From: amibroker@xxxxxxxxxps.com
> [mailto:amibroker@xxxxxxxxxps.com]
> > On Behalf
> > > Of dloyer123
> > > Sent:
Tuesday, 5 August 2008 9:19 AM
> > > To: amibroker@xxxxxxxxxps.com
>
> > Subject: [amibroker] Freakishly fast backtest using 64 cores
>
> >
> > >
> > >
> > >
Greetings,
> > >
> > > I ported part of my AFL
backtest code to a plugin, that takes
> > > advantage of the
graphics math cores on the video card that are
> > > normally
used for 3d graphics.
> > >
> > > I was able to get
a several thousand fold performance
improvement
> > > over
AFL code alone.
> > >
> > > My goal was to reduce the
25 seconds AFL code alone uses for a
> > single
> > >
portfolio level back test to less than 1 second, allowing multi
> day
> > > optimization and walkforward runs to complete in a more
> reasonable
> > > time, and also just to see how fast I
could get it to run.
> > >
> > > The backtest runs
over 1 year of 5 minute bars for about 1000
> > > symbols. 1 year
of data normally takes 25 seconds for AmiBroker
> > > alone, or
18 seconds for 6 months of data. A typical
optimization
> > >
run takes hundreds of these passes per walk forward step,
taking
>
> > hours.
> > >
> > > Using the Nvidia CUDA
API, running on my mid range video card.
It
> > > was much
faster. Much, much, much faster. How fast?
> > >
> >
> It reduced the run time from 25s to... 4.4ms. That is more than
>
> > 200/s!
> > >
> > > I didnt believe the
timing when I saw it at first. So, I put
> 1,000
> > >
runs in a loop and sure enough, it ran 1,000 iterations in
about
>
4
> > > 1/2 seconds. This far exceeded my gaol or
expectations.
> > >
> > > The resulting trade list
matches that obtained by the AFL
version
> > of
> >
> this code.
> > >
> > > I estimate that it is
processing 32GB of bar data/sec.
> > >
> > > Getting
this to work at peak performance was tricky. Most of
what
> I
> > > have learned about code optimization does not apply.
> > >
> > > It uses AmiBroker to load the symbol
data and perform
> calculations
> > > that do not depend
on the optimization parameters. Once loaded
> into
> > >
video memory, repeated passes can be made with different
> >
parameters,
> > > avoiding any overhead.
> > >
> > > For non backtest/optimization runs, the code just
evaluates one
> > > symbol and passes the data back to AmiBroker
buy/sell/short/cover
> > > arrays, making it easy to
test, validate and visualize the
> trades.
> > > There is
very little performance gain in this case.
> > >
> >
> There are problems, however. To run optimizations at peak
speed,
> I
> > > can not use AmiBroker to calculate the
optimization goal
> function.
> > > So, I am in the
process of writing code to match signals and
> > > calculate the
portfolio fitness function. Once I do this, I
will
> be
>
> > able to perform full optimizations and walk forwards at 3
orders
> of
> > > magnitude faster than is possible with
AmiBroker alone.
> > >
> > > Also, this is not
general purpose code. Changing the system
code
> > > means
changing a dll written in C. However, there is no reason
> that
> > > this could not be made more general.
> > >
> > > I have made some prototypes of "Cuda" versions of basic AFL
> > > functions. The idea is to queue the function calls into a
> > definition
> > > executed by a micro kernel running
on the graphics cores. The
> > result
> > > would be
the ability to use the full power of the graphics
cores
> by
> > > modifying AFL code to use Cuda aware versions with no
changes
to
> C
> > > code. It would be an interesting,
but big project.
> > >
> >
>