[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[amibroker] Re: Freakishly fast backtest using 64 cores



PureBytes Links

Trading Reference Links

The idea of using array processing for this problem rather than the 
more traditional for/next loop was a really good idea.  That part of 
the system is very fast at what it does and provides a great amount 
of freedom and flexibility.  

However, consider the trivial system:

Buy = 0;
Sell = 0;
Short = 0;
Cover = 0;
Optimize("val",0,1,10,1);

Clearly the AFL engine is being invoked, but this could be considered 
the fastest possible AFL code, or very close to it.  It's execution 
time is not zero, but pretty darn close.  The "check alf" function 
measures 0.3ms for 64,000 bars, for the "Optimize" statement.

On my 3Gz Core 2, this system takes 18 seconds to backtest over a 
portfolio of 850 symbols, 1 year, 5 minute bars.  This time does not 
vary much with the size of the test window.  Since "Quick AFL" is 
slected, there should be about 20k bars per symbol.  

Running optimize, takes roughly the same time as the backtest for 
each and every pass, 18 seconds each and every time.  That 18 seconds 
can not be explained by the AFL code execution alone.  There is other 
stuff being done that takes much, much, much longer.  

So, even if AFL execution runs in zero time, this is the limit of how 
fast AmiBroker can optimize.

So, yes, the AFL execution is highly optimized and very fast, but 
there is a lot of overhead that is outside of the AFL execution.  I 
could guess what it is doing, but it really does not matter.  

I am only pointing out that it a very juicy opportunity for 
meaningful performance gains.  

Yes, there are memory management issues, and yes, some data sets may 
be too large to take advantage of it, but a large fraction of your 
customer base would.  It could even be made transparent to the user 
with no extra checkboxes.  No exotic hardware required.  It would 
even work on a laptop.  

When I run my system on the emulator, it just runs on the normal cpu 
core, using normal system memory.  If anything, there has to be a lot 
of overhead in pretending to run as so many threads, and the data set 
is far larger than the L2 cache.  But it is still much faster than 
the built in backtest.  Yes, part of that is hand optimized code, but 
that does not explain the performance differential or 20x to 50x of 
Ami vs emulator.  Running on the GPU is more like 4000x.  Yes the 
GPU, has more memory bandwidth to work with, but not that much more.  

I would say that the AFL execution code is highly optimized and fully 
exploits the hardware it has to work with, but that there are 
performance bottlenecks elsewhere in the critical path. I can not 
tell you what they are, but I would guess that it is rebuilding price 
arrays and maybe other data structures on every pass.

Anyway, I dont mean to tell you your business and you are much closer 
to this problem than I am.  Maybe there is some edge case that I have 
not considered that forces a performance hit.  It is still way faster 
than EasyLanguage.  

I am a big fan of your work and enjoy using your product.  The 
passion that you put into it shows.  



--- In amibroker@xxxxxxxxxxxxxxx, "Tomasz Janeczko" <groups@xxx> 
wrote:
>
> Hello,
> 
> What is true for GPU it is not necesarily true for CPU. GPU has 
dedicated wide RAM
> bus and faster RAM as opposed to system memory. 
> 
> AmiBroker does a lot to utilise memory to maximum extent where 
possible/feasible.
> 
> Actually AFL speed is limited by system memory if you run out of on-
chip cache.
> http://www.amibroker.com/kb/2008/08/12/afl-execution-speed/
> 
> So going for more memory usage not always means faster execution.
> 
> Sure you can pre-compute everything, and use pre-computed values 
but 
> you need to understand that people are doing VERY different things 
with AmiBroker
> and their problems are not the same as problems you are trying to 
solve.
> For example some customers are backtesting entire US stock universe 
(8000+ symbols)
> over 10 or 20 years. That's about 1.3GB for DATA alone. Now if you 
are running
> porfolio backtest you need to keep trading signals and that can be 
as much as 1GB in
> such case. Quickly you are reaching 3GB RAM limit of 32 OS. There 
is no place
> to store "pre-computed" values.
> AmiBroker by nature needs to provide best blend of speed, moderate 
memory / CPU requirements.
> User-specific single-task solutions may go into specialisation and 
tricks that are
> not feasible for commercial general-purpose product that is 
intended to keep
> large user base happy.
> 
> Best regards,
> Tomasz Janeczko
> amibroker.com
> ----- Original Message ----- 
> From: "dloyer123" <dloyer123@xxx>
> To: <amibroker@xxxxxxxxxxxxxxx>
> Sent: Tuesday, August 12, 2008 4:09 PM
> Subject: [amibroker] Re: Freakishly fast backtest using 64 cores
> 
> 
> > The programing guide lists the 8600M and 8700M as having 32 
computing 
> > cores.  Not sure what they are clocked at.  Power is an issue.  
The 
> > desktop versions need dedicated power connectors.  The big cards 
need 
> > two.
> > 
> > Actually, when I am doing development on my laptop, I just use 
the 
> > emulator.  It is about 100x slower than my desktop system, but 
still 
> > about 20x to 50x faster than Ami alone.  The speed difference in 
> > emulation mode is mostly due to the precomputed and cached price 
> > arrays.
> > 
> > Tomasz:  I suspect that there is an opportunity to trade memory 
for 
> > speed, even with 1 core.  Memory is cheap and would be a simpler 
way 
> > to get a performance boost than porting to multi core, GPU or 
CPU. 
> > 
> > 
> > 
> > --- In amibroker@xxxxxxxxxxxxxxx, "Tomasz Janeczko" <groups@> 
> > wrote:
> >>
> >> Dell has 3 off the shelf 
> >> > laptops in their entertainment/performance range that use 
GeForce 
> >> > 8600M and 8700M with 256MB & 2*2456MB (min 256 required for 
CUDA?)
> >> 
> >> Mobile ones are very poor cousins. Belive me. I own brand new 
> > notebook (ASUS) with GeForce8600M
> >> and it is SLOW in 3D. I mean SLOW. Did I mention that it is 
SLOW? 
> >> 
> >> In 3D Mark it gets the same results as my 3 year old desktop 
6600GT.
> >> 
> >> Best regards,
> >> Tomasz Janeczko
> >> amibroker.com
> >> ----- Original Message ----- 
> >> From: "brian_z111" <brian_z111@>
> >> To: <amibroker@xxxxxxxxxxxxxxx>
> >> Sent: Tuesday, August 12, 2008 12:40 AM
> >> Subject: [amibroker] Re: Freakishly fast backtest using 64 cores
> >> 
> >> 
> >> > DL
> >> > 
> >> > 
> >> > I am following at the top level and understand what you are 
doing 
> > OK 
> >> > (you make me wish I had learnt programming/IT).
> >> > 
> >> > I like your CPU.
> >> > 
> >> > Allowing niche trading is what AB is all about?
> >> > 
> >> > I'll put my money on MS/"general purpose computing on GPU" - I 
> > don't 
> >> > think the masses are in love with MS but for 80% of people who 
> > can do 
> >> > 80% of what they want with MS the price to move elsewhere is 
too 
> >> > high - they are just in love with max output for min input.
> >> > 
> >> > If you go to the trouble to write a plug-in do you think it 
will 
> > be 
> >> > around long/require much ongoing support from you?
> >> > 
> >> > I can see the benefits of the speed - for a group of traders 
it 
> > is a 
> >> > definite edge they would have for a year or two (I don't think 
> > any 
> >> > other trading software will be seeing this for a while? - 
> > especially 
> >> > in the AT area where more crunching could be done fast enough 
to 
> > keep 
> >> > up with live data.
> >> > 
> >> > I don't blame Tomasz for not sitting his backside on the 
cutting 
> >> > edge - too dangerous for developers with long term clientele.
> >> > 
> >> > Not having a go at Tomasz - to clarify - Tomeasz said GEForce 
> > 8800 
> >> > can't be put in a notebook?
> >> > 
> >> > To my understanding there seems to be a reasonable number of 
> > laptops 
> >> > around that could use your method e.g. Dell has 3 off the 
shelf 
> >> > laptops in their entertainment/performance range that use 
GeForce 
> >> > 8600M and 8700M with 256MB & 2*2456MB (min 256 required for 
CUDA?)
> >> > 
> >> > I looked at the GeF links in Paul's post but they didn't have 
> > much 
> >> > specific info there that I could see - I assume the above 
cards 
> > wiil 
> >> > run your system.
> >> > 
> >> > I am not a buyer for now but good luck with it and what you 
have 
> > done 
> >> > already is a good contribution to AB - once someone on the 
block 
> > has 
> >> > a new super-dooper gadget pretty soon the neighbours want one 
too 
> > and 
> >> > demand grows.
> >> > 
> >> > brian_z
> >> > 
> >> > 
> >> > 
> >> > --- In amibroker@xxxxxxxxxxxxxxx, "dloyer123" <dloyer123@> 
wrote:
> >> >>
> >> >> This uses the mid range video card that happened to come with 
my 
> >> >> system, a 9800GT.  The newer 260 and 280 cards are 3 to 4 
times 
> >> >> faster.  The 260 can be found at best buy for $300.  Some 
> > laptops 
> >> >> have compatible cards as well. 
> >> >> 
> >> >> The video card has its own memory, mine has 512MB, some have 
as 
> >> > much 
> >> >> as 1GB.  This memory is very fast, once it is loaded from the 
> > main 
> >> >> system.  Nvidia has a professional line of products that have 
> > much 
> >> >> more memory.  
> >> >> 
> >> >> Get get the best performance, my AFL code makes one pass over 
> > the 
> >> >> data, calling a Dll.  The Dll takes all of the data needed by 
> > the 
> >> >> calculation and loads a copy to the video card.  This upload 
is 
> >> > slow, 
> >> >> the entire upload takes about 45 seconds for all 1000 
symbols. 
> >> >> 
> >> >> Once all of the data is uploaded, the Dll loads a "kernel" 
into 
> > the 
> >> >> graphics cores that perform the actual computation and 
generates 
> >> > the 
> >> >> trade list.  This part is very fast and performs all of the 
same 
> >> >> functions that my AFL version does.  The resulting trade list 
is 
> >> > the 
> >> >> same.  
> >> >> 
> >> >> Because the data loaded into video memory, it can be resused 
for 
> >> > many 
> >> >> passes over the data with different optimization values.  So, 
> >> >> hundreds of combinations of optimization values can be tried 
per 
> >> >> second.  
> >> >> 
> >> >> For non optimization runs, the Dll just loads one symbol into 
> > video 
> >> >> memory and processes it.  Counting the overhead of moving 
data 
> > to 
> >> > the 
> >> >> video card and extracting the trade list for a single symbol, 
> > the 
> >> >> result is similar to AFL code alone.  This lets me test the 
code 
> >> > and 
> >> >> make sure it is correct.
> >> >> 
> >> >> This approach works best when the data only needs to be 
loaded 
> >> > once, 
> >> >> then "resused" many times.  It also works best when there is 
a 
> > lot 
> >> > of 
> >> >> data to work with. 
> >> >> 
> >> >> What is more interesting to me and what would be more useful 
for 
> >> >> others would be a general drive that requires no Dll changes 
to 
> >> >> modify the system.  The performance would not be as good as 
hand 
> >> >> optimized code, but would still be much better than AFL code 
> >> > alone.  
> >> >> It would take trading system design to a whole new level.  It 
> > would 
> >> >> provide enough performance to make working with Intra day 
data 
> > as 
> >> >> easy as daily data is today.
> >> >> 
> >> >> Writing such a driver would be hard, but I have already done 
> > some 
> >> >> prototypes and design work.  I am tempted to do it for my own 
> > use.  
> >> >> If I made it available to others supporting it would be a 
PITA.  
> >> >> 
> >> >> 
> >> >> 
> >> >> 
> >> >> --- In amibroker@xxxxxxxxxxxxxxx, "Paul Ho" <paul.tsho@> 
wrote:
> >> >> >
> >> >> > I'm very interested
> >> >> > could you elaborate a bit more
> >> >> > What model of Nvidia chipset are you using, and with how 
much 
> >> >> memory?
> >> >> > Not sure exactly what you mean when you say
> >> >> > It uses AmiBroker to load the symbol data and perform 
> >> > calculations 
> >> >> > that do not depend on the optimization parameters. Once 
loaded 
> >> > into 
> >> >> > video memory, repeated passes can be made with different 
> >> >> parameters, 
> >> >> > avoiding any overhead. 
> >> >> > Can you give me some examples. I presume when your dll is 
> > called. 
> >> >> AB passes
> >> >> > one or more arrays of data belonging to 1 symbol, is that 
true?
> >> >> > Not sure exactly what the rest mean either. How many 
functions 
> >> > are 
> >> >> you
> >> >> > running in your dll, and what does each of the do?
> >> >> > Great of you to share your insight.
> >> >> > Cheers
> >> >> > Paul.
> >> >> >  
> >> >> > 
> >> >> > 
> >> >> >   _____  
> >> >> > 
> >> >> > From: amibroker@xxxxxxxxxxxxxxx 
> >> > [mailto:amibroker@xxxxxxxxxxxxxxx] 
> >> >> On Behalf
> >> >> > Of dloyer123
> >> >> > Sent: Tuesday, 5 August 2008 9:19 AM
> >> >> > To: amibroker@xxxxxxxxxxxxxxx
> >> >> > Subject: [amibroker] Freakishly fast backtest using 64 cores
> >> >> > 
> >> >> > 
> >> >> > 
> >> >> > Greetings,
> >> >> > 
> >> >> > I ported part of my AFL backtest code to a plugin, that 
takes 
> >> >> > advantage of the graphics math cores on the video card that 
> > are 
> >> >> > normally used for 3d graphics. 
> >> >> > 
> >> >> > I was able to get a several thousand fold performance 
> > improvement 
> >> >> > over AFL code alone.
> >> >> > 
> >> >> > My goal was to reduce the 25 seconds AFL code alone uses 
for a 
> >> >> single 
> >> >> > portfolio level back test to less than 1 second, allowing 
> > multi 
> >> > day 
> >> >> > optimization and walkforward runs to complete in a more 
> >> > reasonable 
> >> >> > time, and also just to see how fast I could get it to run.
> >> >> > 
> >> >> > The backtest runs over 1 year of 5 minute bars for about 
1000 
> >> >> > symbols. 1 year of data normally takes 25 seconds for 
> > AmiBroker 
> >> >> > alone, or 18 seconds for 6 months of data. A typical 
> > optimization 
> >> >> > run takes hundreds of these passes per walk forward step, 
> > taking 
> >> >> > hours.
> >> >> > 
> >> >> > Using the Nvidia CUDA API, running on my mid range video 
card. 
> > It 
> >> >> > was much faster. Much, much, much faster. How fast?
> >> >> > 
> >> >> > It reduced the run time from 25s to... 4.4ms. That is more 
> > than 
> >> >> > 200/s! 
> >> >> > 
> >> >> > I didnt believe the timing when I saw it at first. So, I 
put 
> >> > 1,000 
> >> >> > runs in a loop and sure enough, it ran 1,000 iterations in 
> > about 
> >> > 4 
> >> >> > 1/2 seconds. This far exceeded my gaol or expectations.
> >> >> > 
> >> >> > The resulting trade list matches that obtained by the AFL 
> > version 
> >> >> of 
> >> >> > this code. 
> >> >> > 
> >> >> > I estimate that it is processing 32GB of bar data/sec.
> >> >> > 
> >> >> > Getting this to work at peak performance was tricky. Most 
of 
> > what 
> >> > I 
> >> >> > have learned about code optimization does not apply. 
> >> >> > 
> >> >> > It uses AmiBroker to load the symbol data and perform 
> >> > calculations 
> >> >> > that do not depend on the optimization parameters. Once 
loaded 
> >> > into 
> >> >> > video memory, repeated passes can be made with different 
> >> >> parameters, 
> >> >> > avoiding any overhead. 
> >> >> > 
> >> >> > For non backtest/optimization runs, the code just evaluates 
> > one 
> >> >> > symbol and passes the data back to AmiBroker 
> > buy/sell/short/cover 
> >> >> > arrays, making it easy to test, validate and visualize the 
> >> > trades. 
> >> >> > There is very little performance gain in this case. 
> >> >> > 
> >> >> > There are problems, however. To run optimizations at peak 
> > speed, 
> >> > I 
> >> >> > can not use AmiBroker to calculate the optimization goal 
> >> > function. 
> >> >> > So, I am in the process of writing code to match signals 
and 
> >> >> > calculate the portfolio fitness function. Once I do this, I 
> > will 
> >> > be 
> >> >> > able to perform full optimizations and walk forwards at 3 
> > orders 
> >> > of 
> >> >> > magnitude faster than is possible with AmiBroker alone.
> >> >> > 
> >> >> > Also, this is not general purpose code. Changing the system 
> > code 
> >> >> > means changing a dll written in C. However, there is no 
reason 
> >> > that 
> >> >> > this could not be made more general. 
> >> >> > 
> >> >> > I have made some prototypes of "Cuda" versions of basic AFL 
> >> >> > functions. The idea is to queue the function calls into a 
> >> >> definition 
> >> >> > executed by a micro kernel running on the graphics cores. 
The 
> >> >> result 
> >> >> > would be the ability to use the full power of the graphics 
> > cores 
> >> > by 
> >> >> > modifying AFL code to use Cuda aware versions with no 
changes 
> > to 
> >> > C 
> >> >> > code. It would be an interesting, but big project.
> >> >> >
> >> >>
> >> > 
> >> > 
> >> > 
> >> > ------------------------------------
> >> > 
> >> > Please note that this group is for discussion between users 
only.
> >> > 
> >> > To get support from AmiBroker please send an e-mail directly 
to 
> >> > SUPPORT {at} amibroker.com
> >> > 
> >> > For NEW RELEASE ANNOUNCEMENTS and other news always check 
DEVLOG:
> >> > http://www.amibroker.com/devlog/
> >> > 
> >> > For other support material please check also:
> >> > http://www.amibroker.com/support.html
> >> > Yahoo! Groups Links
> >> > 
> >> > 
> >> >
> >>
> > 
> > 
> > 
> > ------------------------------------
> > 
> > Please note that this group is for discussion between users only.
> > 
> > To get support from AmiBroker please send an e-mail directly to 
> > SUPPORT {at} amibroker.com
> > 
> > For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
> > http://www.amibroker.com/devlog/
> > 
> > For other support material please check also:
> > http://www.amibroker.com/support.html
> > Yahoo! Groups Links
> > 
> > 
> >
>



------------------------------------

Please note that this group is for discussion between users only.

To get support from AmiBroker please send an e-mail directly to 
SUPPORT {at} amibroker.com

For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
http://www.amibroker.com/devlog/

For other support material please check also:
http://www.amibroker.com/support.html
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/amibroker/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/amibroker/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:amibroker-digest@xxxxxxxxxxxxxxx 
    mailto:amibroker-fullfeatured@xxxxxxxxxxxxxxx

<*> To unsubscribe from this group, send an email to:
    amibroker-unsubscribe@xxxxxxxxxxxxxxx

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/