Very good question. That was a head scratcher. 
So the thing is, 
  AmiBroker does a lot more work in a optimization 
pass then execute AFL 
  code. In fact, the AFL code may take very 
little of the total run 
  time.
As an example, using a database with good amount of data, write a 
  afl 
file that does nothing buy set the buy/sell/short/cover arrays to 
  0. 
The backtest will still take a good bit of time. 
So, even 
  reducing the AFL run time to zero is not enough. It will 
not help much at 
  all.
So, to avoid this, I pass a "mode" variable to my Dll. This mode 
  is 
set by a simple optimization statement:
mode = 
  optimize("mode",0,1,3,1);
When mode = 0, the dll will 
  evaluate one symbol like a normal dll. 
So if I click on a bar, it will 
  update my printf statements, etc. 
buy/sell/short/cover arrays are 
  set. A single backtest (not 
optimize) will use the normal AmiBroker trade 
  match and evaluate code 
and generate stats as normal.
When mode = 
  1, this means load the data. The Dll will copy the price 
data to a stage 
  area in memory. buy/sell/short/cover are set to 0 to 
generate no 
  trades. Having AmiBroker align the symbol bars was a big 
help 
  here.
When mode = 2, on the first symbol and the first symbol only, it 
  
loads the price data to the video card and executes as many backtest 
  
passes as it needs at a few ms per pass. Once the best combination 
is 
  found it returns. buy/sell/short/cover are set to 0. Note that I 
can 
  not use the Amibroker signal match and fitness function code. I 
have to 
  provide my own. This is where the performance advantage of 
all of the 
  extra cores come into play. It may run hundreds or 
thousands of parameter 
  combinations very quickly. I cant use the 
built in optimize suppport, but 
  brute force is enough for now. After 
all, I get 200 combinations per 
  second.
When mode = 3, each symbol evaluates using the best parms found 
  on 
the last mode=2 run. buy/sell/short/cover are set. In a 
  walkforward 
test, this will always have the best score and be used for the 
  
walkforward step. A custom backtest function adds the chosen 
  
parameters to the backtest report. Mode 3 works like mode 0 except 
it 
  uses the optimal parameters rather than defualt values.
The 
  action("status") and action("statusex") codes could also be used, 
  
but they did not tell me quite what I needed to know. Also, I could 
  
have avoided the mode=2 step if I could find a way to know I was on 
  
the last symbol and run the optimization then. I guess I could pass 
  
the name of the last symbol. 
So I use AmiBroker to load and keep 
  the datbase, visualize the 
trades, validate, walkforward and provide deep 
  metrics of the 
backtest. 
If I wanted to take this further, I would 
  move the trade system logic 
out of the Dll and make it programable from 
  Afl. That way it could 
be used by anyone without needing to program C. I 
  would do this by 
passing handles to cuda arrays through the Afl code. 
  
--- In amibroker@xxxxxxxxxps.com, 
  "Paul Ho" <paul.tsho@x..> wrote:
>
> thanks for your 
  insight.
> I hope you dont mind sharing a little bit more detail
> 
  You said "
> Get get the best performance, my AFL code makes one pass 
  over the 
> > data, calling a Dll. The Dll takes all of the data 
  needed by the 
> > calculation and loads a copy to the video card. 
  This upload is 
> slow, 
> > the entire upload takes about 45 
  seconds for all 1000 symbols. 
> > 
> > Once all of the data 
  is uploaded, the Dll loads a "kernel" into 
the 
> > graphics 
  cores that perform the actual computation and generates 
> the 
> 
  > trade list. 
> 
> normally AB loads the data from database as 
  needed, and calls a 
> function in a dll, and passes data in arrays or 
  whatever as 
arguments 
> of the function. The function will be 
  called for every ticker in 
the 
> watchlist, and data pertaining 
  that symbol is passed each time. I 
> wonder how you do a "single pass" 
  over the data. Because AB passes 
> the data as part of the argument 
  regardless of how many 
optimizations 
> It had previously with the 
  same data. I just wonder you do it.
> cheers
> Paul.
> 
  
> --- In amibroker@xxxxxxxxxps.com, 
  "dloyer123" <dloyer123@> wrote:
> >
> > This uses the 
  mid range video card that happened to come with my 
> > system, a 
  9800GT. The newer 260 and 280 cards are 3 to 4 times 
> > faster. The 
  260 can be found at best buy for $300. Some laptops 
> > have 
  compatible cards as well. 
> > 
> > The video card has its 
  own memory, mine has 512MB, some have as 
> much 
> > as 1GB. 
  This memory is very fast, once it is loaded from the 
main 
> > 
  system. Nvidia has a professional line of products that have 
much 
> 
  > more memory. 
> > 
> > Get get the best performance, my 
  AFL code makes one pass over the 
> > data, calling a Dll. The Dll 
  takes all of the data needed by the 
> > calculation and loads a copy 
  to the video card. This upload is 
> slow, 
> > the entire 
  upload takes about 45 seconds for all 1000 symbols. 
> > 
> 
  > Once all of the data is uploaded, the Dll loads a "kernel" into 
the 
  
> > graphics cores that perform the actual computation and generates 
  
> the 
> > trade list. This part is very fast and performs all 
  of the same 
> > functions that my AFL version does. The resulting 
  trade list is 
> the 
> > same. 
> > 
> > 
  Because the data loaded into video memory, it can be resused for 
> many 
  
> > passes over the data with different optimization values. So, 
  
> > hundreds of combinations of optimization values can be tried per 
  
> > second. 
> > 
> > For non optimization runs, 
  the Dll just loads one symbol into 
video 
> > memory and 
  processes it. Counting the overhead of moving data to 
> the 
> 
  > video card and extracting the trade list for a single symbol, the 
  
> > result is similar to AFL code alone. This lets me test the code 
  
> and 
> > make sure it is correct.
> > 
> > 
  This approach works best when the data only needs to be loaded 
> once, 
  
> > then "resused" many times. It also works best when there is a 
  
lot 
> of 
> > data to work with. 
> > 
> 
  > What is more interesting to me and what would be more useful for 
> 
  > others would be a general drive that requires no Dll changes to 
> 
  > modify the system. The performance would not be as good as hand 
> 
  > optimized code, but would still be much better than AFL code 
> 
  alone. 
> > It would take trading system design to a whole new level. 
  It 
would 
> > provide enough performance to make working with 
  Intra day data as 
> > easy as daily data is today.
> > 
  
> > Writing such a driver would be hard, but I have already done 
  some 
> > prototypes and design work. I am tempted to do it for my 
  own 
use. 
> > If I made it available to others supporting it 
  would be a PITA. 
> > 
> > 
> > 
> > 
  
> > --- In amibroker@xxxxxxxxxps.com, 
  "Paul Ho" <paul.tsho@> wrote:
> > >
> > > I'm 
  very interested
> > > could you elaborate a bit more
> > 
  > What model of Nvidia chipset are you using, and with how much 
> 
  > memory?
> > > Not sure exactly what you mean when you 
  say
> > > It uses AmiBroker to load the symbol data and perform 
  
> calculations 
> > > that do not depend on the 
  optimization parameters. Once loaded 
> into 
> > > video 
  memory, repeated passes can be made with different 
> > parameters, 
  
> > > avoiding any overhead. 
> > > Can you give me 
  some examples. I presume when your dll is 
called. 
> > AB 
  passes
> > > one or more arrays of data belonging to 1 symbol, is 
  that true?
> > > Not sure exactly what the rest mean either. How 
  many functions 
> are 
> > you
> > > running in 
  your dll, and what does each of the do?
> > > Great of you to 
  share your insight.
> > > Cheers
> > > Paul.
> 
  > > 
> > > 
> > > 
> > > _____ 
  
> > > 
> > > From: amibroker@xxxxxxxxxps.com 
  
> [mailto:amibroker@xxxxxxxxxps.com] 
  
> > On Behalf
> > > Of dloyer123
> > > Sent: 
  Tuesday, 5 August 2008 9:19 AM
> > > To: amibroker@xxxxxxxxxps.com
> 
  > > Subject: [amibroker] Freakishly fast backtest using 64 cores
> 
  > > 
> > > 
> > > 
> > > 
  Greetings,
> > > 
> > > I ported part of my AFL 
  backtest code to a plugin, that takes 
> > > advantage of the 
  graphics math cores on the video card that are 
> > > normally 
  used for 3d graphics. 
> > > 
> > > I was able to get 
  a several thousand fold performance 
improvement 
> > > over 
  AFL code alone.
> > > 
> > > My goal was to reduce the 
  25 seconds AFL code alone uses for a 
> > single 
> > > 
  portfolio level back test to less than 1 second, allowing multi 
> day 
  
> > > optimization and walkforward runs to complete in a more 
  
> reasonable 
> > > time, and also just to see how fast I 
  could get it to run.
> > > 
> > > The backtest runs 
  over 1 year of 5 minute bars for about 1000 
> > > symbols. 1 year 
  of data normally takes 25 seconds for AmiBroker 
> > > alone, or 
  18 seconds for 6 months of data. A typical 
optimization 
> > > 
  run takes hundreds of these passes per walk forward step, 
taking 
> 
  > > hours.
> > > 
> > > Using the Nvidia CUDA 
  API, running on my mid range video card. 
It 
> > > was much 
  faster. Much, much, much faster. How fast?
> > > 
> > 
  > It reduced the run time from 25s to... 4.4ms. That is more than 
> 
  > > 200/s! 
> > > 
> > > I didnt believe the 
  timing when I saw it at first. So, I put 
> 1,000 
> > > 
  runs in a loop and sure enough, it ran 1,000 iterations in 
about 
> 
  4 
> > > 1/2 seconds. This far exceeded my gaol or 
  expectations.
> > > 
> > > The resulting trade list 
  matches that obtained by the AFL 
version 
> > of 
> > 
  > this code. 
> > > 
> > > I estimate that it is 
  processing 32GB of bar data/sec.
> > > 
> > > Getting 
  this to work at peak performance was tricky. Most of 
what 
> I 
  
> > > have learned about code optimization does not apply. 
  
> > > 
> > > It uses AmiBroker to load the symbol 
  data and perform 
> calculations 
> > > that do not depend 
  on the optimization parameters. Once loaded 
> into 
> > > 
  video memory, repeated passes can be made with different 
> > 
  parameters, 
> > > avoiding any overhead. 
> > > 
  
> > > For non backtest/optimization runs, the code just 
  evaluates one 
> > > symbol and passes the data back to AmiBroker 
  
buy/sell/short/cover 
> > > arrays, making it easy to 
  test, validate and visualize the 
> trades. 
> > > There is 
  very little performance gain in this case. 
> > > 
> > 
  > There are problems, however. To run optimizations at peak 
speed, 
  
> I 
> > > can not use AmiBroker to calculate the 
  optimization goal 
> function. 
> > > So, I am in the 
  process of writing code to match signals and 
> > > calculate the 
  portfolio fitness function. Once I do this, I 
will 
> be 
> 
  > > able to perform full optimizations and walk forwards at 3 
orders 
  
> of 
> > > magnitude faster than is possible with 
  AmiBroker alone.
> > > 
> > > Also, this is not 
  general purpose code. Changing the system 
code 
> > > means 
  changing a dll written in C. However, there is no reason 
> that 
  
> > > this could not be made more general. 
> > > 
  
> > > I have made some prototypes of "Cuda" versions of basic AFL 
  
> > > functions. The idea is to queue the function calls into a 
  
> > definition 
> > > executed by a micro kernel running 
  on the graphics cores. The 
> > result 
> > > would be 
  the ability to use the full power of the graphics 
cores 
> by 
  
> > > modifying AFL code to use Cuda aware versions with no 
  changes 
to 
> C 
> > > code. It would be an interesting, 
  but big project.
> > >
> >
>