Which video cards provide this feature? As far as I can tell, it's only the
8-Series (G8X) GPU from NVIDIA, found in the GeForce, Quadro and Tesla
lines. Who will have one of these? Are many people likely to have them in
the future?
Thanks
----- Original Message -----
From:
"dloyer123" <dloyer123@xxxxxxcom>
To:
<amibroker@xxxxxxxxxps.com>
Sent:
Wednesday, August 06, 2008 12:22 AM
Subject: [amibroker] Re: Freakishly
fast backtest using 64 cores
> Very good question. That was a head
scratcher.
>
> So the thing is, AmiBroker does a lot more work in
a optimization
> pass then execute AFL code. In fact, the AFL code may
take very
> little of the total run time.
>
> As an example,
using a database with good amount of data, write a afl
> file that does
nothing buy set the buy/sell/short/cover arrays to 0.
> The
backtest will still take a good bit of time.
>
> So, even reducing
the AFL run time to zero is not enough. It will
> not help much at
all.
>
> So, to avoid this, I pass a "mode" variable to my Dll.
This mode is
> set by a simple optimization statement:
>
>
mode = optimize("mode",0,1,3,1);
>
> When mode = 0, the
dll will evaluate one symbol like a normal dll.
> So if I click on a
bar, it will update my printf statements, etc.
>
buy/sell/short/cover arrays are set. A single backtest (not
>
optimize) will use the normal AmiBroker trade match and evaluate code
>
and generate stats as normal.
>
> When mode = 1, this means load
the data. The Dll will copy the price
> data to a stage area in memory.
buy/sell/short/cover are set to 0 to
> generate no trades. Having
AmiBroker align the symbol bars was a big
> help here.
>
>
When mode = 2, on the first symbol and the first symbol only, it
> loads
the price data to the video card and executes as many backtest
> passes
as it needs at a few ms per pass. Once the best combination
> is found
it returns. buy/sell/short/cover are set to 0. Note that I
> can
not use the Amibroker signal match and fitness function code. I
> have
to provide my own. This is where the performance advantage of
> all of
the extra cores come into play. It may run hundreds or
> thousands of
parameter combinations very quickly. I cant use the
> built in optimize
suppport, but brute force is enough for now. After
> all, I get 200
combinations per second.
>
> When mode = 3, each symbol evaluates
using the best parms found on
> the last mode=2 run.
buy/sell/short/cover are set. In a walkforward
> test, this will
always have the best score and be used for the
> walkforward step. A
custom backtest function adds the chosen
> parameters to the backtest
report. Mode 3 works like mode 0 except
> it uses the optimal parameters
rather than defualt values.
>
> The action("status") and
action("statusex") codes could also be used,
> but they did not
tell me quite what I needed to know. Also, I could
> have avoided the
mode=2 step if I could find a way to know I was on
> the last symbol and
run the optimization then. I guess I could pass
> the name of the last
symbol.
>
> So I use AmiBroker to load and keep the datbase,
visualize the
> trades, validate, walkforward and provide deep metrics
of the
> backtest.
>
> If I wanted to take this further, I
would move the trade system logic
> out of the Dll and make it
programable from Afl. That way it could
> be used by anyone without
needing to program C. I would do this by
> passing handles to cuda
arrays through the Afl code.
>
>
>
>
> --- In amibroker@xxxxxxxxxps.com,
"Paul Ho" <paul.tsho@x..> wrote:
>>
>> thanks for
your insight.
>> I hope you dont mind sharing a little bit more
detail
>> You said "
>> Get get the best performance, my AFL
code makes one pass over the
>> > data, calling a Dll. The Dll
takes all of the data needed by the
>> > calculation and loads a
copy to the video card. This upload is
>> slow,
>> > the
entire upload takes about 45 seconds for all 1000 symbols.
>>
>
>> > Once all of the data is uploaded, the Dll loads a
"kernel" into
> the
>> > graphics cores that perform the
actual computation and generates
>> the
>> > trade
list.
>>
>> normally AB loads the data from database as
needed, and calls a
>> function in a dll, and passes data in arrays
or whatever as
> arguments
>> of the function. The function
will be called for every ticker in
> the
>> watchlist, and data
pertaining that symbol is passed each time. I
>> wonder how you do a
"single pass" over the data. Because AB passes
>> the data as part of
the argument regardless of how many
> optimizations
>> It had
previously with the same data. I just wonder you do it.
>>
cheers
>> Paul.
>>
>> --- In amibroker@xxxxxxxxxps.com,
"dloyer123" <dloyer123@> wrote:
>> >
>> > This
uses the mid range video card that happened to come with my
>> >
system, a 9800GT. The newer 260 and 280 cards are 3 to 4 times
>>
> faster. The 260 can be found at best buy for $300. Some
laptops
>> > have compatible cards as well.
>>
>
>> > The video card has its own memory, mine has 512MB, some
have as
>> much
>> > as 1GB. This memory is very fast,
once it is loaded from the
> main
>> > system. Nvidia has a
professional line of products that have
> much
>> > more
memory.
>> >
>> > Get get the best performance, my AFL
code makes one pass over the
>> > data, calling a Dll. The Dll
takes all of the data needed by the
>> > calculation and loads a
copy to the video card. This upload is
>> slow,
>> > the
entire upload takes about 45 seconds for all 1000 symbols.
>>
>
>> > Once all of the data is uploaded, the Dll loads a
"kernel" into
> the
>> > graphics cores that perform the
actual computation and generates
>> the
>> > trade list.
This part is very fast and performs all of the same
>> > functions
that my AFL version does. The resulting trade list is
>>
the
>> > same.
>> >
>> > Because the data
loaded into video memory, it can be resused for
>> many
>>
> passes over the data with different optimization values. So,
>>
> hundreds of combinations of optimization values can be tried
per
>> > second.
>> >
>> > For non
optimization runs, the Dll just loads one symbol into
>
video
>> > memory and processes it. Counting the overhead of
moving data to
>> the
>> > video card and extracting the
trade list for a single symbol, the
>> > result is similar to AFL
code alone. This lets me test the code
>> and
>> > make
sure it is correct.
>> >
>> > This approach works best
when the data only needs to be loaded
>> once,
>> > then
"resused" many times. It also works best when there is a
>
lot
>> of
>> > data to work with.
>>
>
>> > What is more interesting to me and what would be more
useful for
>> > others would be a general drive that requires no
Dll changes to
>> > modify the system. The performance would not
be as good as hand
>> > optimized code, but would still be much
better than AFL code
>> alone.
>> > It would take trading
system design to a whole new level. It
> would
>> > provide
enough performance to make working with Intra day data as
>> >
easy as daily data is today.
>> >
>> > Writing such a
driver would be hard, but I have already done some
>> > prototypes
and design work. I am tempted to do it for my own
> use.
>>
> If I made it available to others supporting it would be a
PITA.
>> >
>> >
>> >
>>
>
>> > --- In amibroker@xxxxxxxxxps.com,
"Paul Ho" <paul.tsho@> wrote:
>> > >
>> >
> I'm very interested
>> > > could you elaborate a bit
more
>> > > What model of Nvidia chipset are you using, and
with how much
>> > memory?
>> > > Not sure exactly
what you mean when you say
>> > > It uses AmiBroker to load the
symbol data and perform
>> calculations
>> > > that do
not depend on the optimization parameters. Once loaded
>>
into
>> > > video memory, repeated passes can be made with
different
>> > parameters,
>> > > avoiding any
overhead.
>> > > Can you give me some examples. I presume when
your dll is
> called.
>> > AB passes
>> > >
one or more arrays of data belonging to 1 symbol, is that true?
>>
> > Not sure exactly what the rest mean either. How many
functions
>> are
>> > you
>> > > running
in your dll, and what does each of the do?
>> > > Great of you
to share your insight.
>> > > Cheers
>> > >
Paul.
>> > >
>> > >
>> >
>
>> > > _____
>> > >
>> > >
From: amibroker@xxxxxxxxxps.com
>>
[mailto:amibroker@xxxxxxxxxps.com]
>>
> On Behalf
>> > > Of dloyer123
>> > > Sent:
Tuesday, 5 August 2008 9:19 AM
>> > > To: amibroker@xxxxxxxxxps.com
>>
> > Subject: [amibroker] Freakishly fast backtest using 64
cores
>> > >
>> > >
>> >
>
>> > > Greetings,
>> > >
>> >
> I ported part of my AFL backtest code to a plugin, that takes
>>
> > advantage of the graphics math cores on the video card that
are
>> > > normally used for 3d graphics.
>> >
>
>> > > I was able to get a several thousand fold
performance
> improvement
>> > > over AFL code
alone.
>> > >
>> > > My goal was to reduce the
25 seconds AFL code alone uses for a
>> > single
>> >
> portfolio level back test to less than 1 second, allowing
multi
>> day
>> > > optimization and walkforward runs
to complete in a more
>> reasonable
>> > > time, and
also just to see how fast I could get it to run.
>> >
>
>> > > The backtest runs over 1 year of 5 minute bars for
about 1000
>> > > symbols. 1 year of data normally takes 25
seconds for AmiBroker
>> > > alone, or 18 seconds for 6 months
of data. A typical
> optimization
>> > > run takes
hundreds of these passes per walk forward step,
> taking
>>
> > hours.
>> > >
>> > > Using the Nvidia
CUDA API, running on my mid range video card.
> It
>> > >
was much faster. Much, much, much faster. How fast?
>> >
>
>> > > It reduced the run time from 25s to... 4.4ms. That
is more than
>> > > 200/s!
>> > >
>>
> > I didnt believe the timing when I saw it at first. So, I
put
>> 1,000
>> > > runs in a loop and sure enough, it
ran 1,000 iterations in
> about
>> 4
>> > > 1/2
seconds. This far exceeded my gaol or expectations.
>> >
>
>> > > The resulting trade list matches that obtained by
the AFL
> version
>> > of
>> > > this
code.
>> > >
>> > > I estimate that it is
processing 32GB of bar data/sec.
>> > >
>> > >
Getting this to work at peak performance was tricky. Most of
>
what
>> I
>> > > have learned about code optimization
does not apply.
>> > >
>> > > It uses AmiBroker
to load the symbol data and perform
>> calculations
>> >
> that do not depend on the optimization parameters. Once
loaded
>> into
>> > > video memory, repeated passes
can be made with different
>> > parameters,
>> > >
avoiding any overhead.
>> > >
>> > > For non
backtest/optimization runs, the code just evaluates one
>> >
> symbol and passes the data back to AmiBroker
>
buy/sell/short/cover
>> > > arrays, making it easy to
test, validate and visualize the
>> trades.
>> > >
There is very little performance gain in this case.
>> >
>
>> > > There are problems, however. To run optimizations
at peak
> speed,
>> I
>> > > can not use
AmiBroker to calculate the optimization goal
>> function.
>>
> > So, I am in the process of writing code to match signals
and
>> > > calculate the portfolio fitness function. Once I do
this, I
> will
>> be
>> > > able to perform full
optimizations and walk forwards at 3
> orders
>> of
>>
> > magnitude faster than is possible with AmiBroker alone.
>>
> >
>> > > Also, this is not general purpose code.
Changing the system
> code
>> > > means changing a dll
written in C. However, there is no reason
>> that
>> >
> this could not be made more general.
>> > >
>>
> > I have made some prototypes of "Cuda" versions of basic
AFL
>> > > functions. The idea is to queue the function calls
into a
>> > definition
>> > > executed by a micro
kernel running on the graphics cores. The
>> > result
>>
> > would be the ability to use the full power of the graphics
>
cores
>> by
>> > > modifying AFL code to use Cuda
aware versions with no changes
> to
>> C
>> > >
code. It would be an interesting, but big project.
>> >
>
>> >
>>
>
>
>
>
------------------------------------
>
> Please
note that this group is for discussion between users only.
>
> To
get support from AmiBroker please send an e-mail directly to
> SUPPORT
{at} amibroker.com
>
> For NEW RELEASE ANNOUNCEMENTS and other
news always check DEVLOG:
> http://www.amibroker.com/devlog/
>
>
For other support material please check also:
> http://www.amibroker.com/support.html
>
Yahoo! Groups Links
>
>
>