[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[amibroker] Re: Cuda prototype using 64 cores on graphics card



PureBytes Links

Trading Reference Links

Not a chance.  C code can be run on the cores, but they do not 
support the win32 api.  They are optimized for performing repeated 
calculations over and over on large chunks of data.  

Some products such as photoshop and matlab have plug ins that allow 
Cuda to be used as a co-processor.  It would be possible to do the 
same with Ami broker, but I doubt that the market is large enough to 
support the effort.  I would love to be proven wrong.  

A good description can be found in the programing guide here:
http://www.nvidia.com/object/cuda_develop.html

The general idea is that they have a similar transister count as the 
CPU, but spend the transister budget on execution units rather than 
cache, then use a fat pipe to memory.  

The current high end cards have 240 cores and new models come out 
every 6 months.  The card I am using is very modest. 

I would not be surpised if it is possible to perform a backtest over 
a few years of 5 minute bars at video rates (say >20/sec) 

--- In amibroker@xxxxxxxxxxxxxxx, ftonetti@xxx wrote:
>
> This is very interesting ... AB Dll's are one thing ... Do you 
think it's possible to run individual instances of AB itself with 
CUDA ?
> 
> ----- Original Message -----
> From: dloyer123 
> Date: Thursday, July 24, 2008 3:13 pm
> Subject: [amibroker] Cuda prototype using 64 cores on graphics card
> To: amibroker@xxxxxxxxxxxxxxx
> 
> > I was able to get a AmiBroker dll to work with Nvidia CUDA 
drivers.
> > 
> > These drivers allow C code to run on the graphics shares of a 
> > modern 
> > video card. These are the same processors that allow high speed 
> > 3d 
> > graphics. Several math intensive applications report a 50-100 
> > fold 
> > performance improvement over running on the host cpu. 
> > 
> > The mid range card that came on my system has 64 cores, each 
> > able to 
> > perform one floating point operation per clock.
> > 
> > As a simple test, I wrote a AmiBroker plug in, called by AFL. 
> > 
> > It calculated the average price (H+L+C)/3 for 60464 bars in 21us.
> > 
> > This works out to about 8.5GF (billion floating point operations 
> > per 
> > second) and 46GB/s memory transfer speed. (read 3 floats and 
> > write 
> > one per bar), (2 floating point adds and 1 multiply per bar)
> > 
> > The 46GB/s transfer rate is not far from the available memory 
> > bandwidth on the card, but the simple test calculation is not 
> > very "dense" so, I should be able to get a much higher 
> > calculation 
> > rate once I move more of my code to the graphics cores. Several 
> > of 
> > the CUDA demos report > 150GF/s. Memory is the bottleneck of 
> > this 
> > simple test. I used one thread per bar.
> > 
> > High end graphics cards are available now that would improve 
> > performance by another factor of 2 to 4. 
> > 
> > A few problems:
> > * The above numbers do not include the time needed to copy the 
> > data 
> > from ami to the graphics card or copy the results back. This 
> > time is 
> > much greater than the calculation time in this simple test.
> > * This is not a general AFL accelerator. 
> > 
> > My goal is to reduce my current 25s backtest time down to < 1s 
> > per 
> > pass. To do this, I will need to move the data set for all 
> > symbols 
> > to the graphics card once and make many passes over the data 
> > with 
> > different optimization values. Each CUDA thread will work on 
> > one 
> > symbol, rather than a thread per bar as in my first test. 
> > 
> > There is not much point in writing a CUDA routine to just 
> > execute 
> > directly from AFL code. There is too much overhead. In my 
> > application, the AFL code is a very small part of the total time 
> > for 
> > each backtest. Even if I reduced the time to zero, it would not 
> > reduce the time per pass very much. Also, the time needed to 
> > copy 
> > the price data on each pass would greatly reduce the benefit. 
> > As far 
> > as I can tell, the current Ami API does not allow injecting a 
> > externally generated trade list into the backtest, so I will 
> > need to 
> > perform the full backtest and fitness function calculation 
> > externally. 
> > 
> > I had no compatibility problems getting the CUDA api to run as a 
> > Ami 
> > plug in. 
> > 
> > Why go to the trouble? Using Fred's IO program would get much 
> > of the 
> > same benefit for less trouble, or I could wait until Ami finally 
> > supports multi cores, or finds other clever ways to reduce the 
> > per 
> > pass overhead. The real answer is that I just had to try it....
> > 
> > 
> > 
> > 
> > 
> >
>



------------------------------------

Please note that this group is for discussion between users only.

To get support from AmiBroker please send an e-mail directly to 
SUPPORT {at} amibroker.com

For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
http://www.amibroker.com/devlog/

For other support material please check also:
http://www.amibroker.com/support.html
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/amibroker/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/amibroker/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:amibroker-digest@xxxxxxxxxxxxxxx 
    mailto:amibroker-fullfeatured@xxxxxxxxxxxxxxx

<*> To unsubscribe from this group, send an email to:
    amibroker-unsubscribe@xxxxxxxxxxxxxxx

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/