PureBytes Links
Trading Reference Links
|
This is very interesting ... AB Dll's are one thing ... Do you think it's possible to run individual instances of AB itself with CUDA ? ----- Original Message ----- From: dloyer123 Date: Thursday, July 24, 2008 3:13 pm Subject: [amibroker] Cuda prototype using 64 cores on graphics card To: amibroker@xxxxxxxxxxxxxxx
> I was able to get a AmiBroker dll to work with Nvidia CUDA drivers. > > These drivers allow C code to run on the graphics shares of a > modern > video card. These are the same processors that allow high speed > 3d > graphics. Several math intensive applications report a 50-100 > fold > performance improvement over running on the host cpu. > > The mid range card that came on my system has 64 cores, each > able to > perform one floating point operation per clock. > > As a simple test, I wrote a AmiBroker plug in, called by AFL. > > It calculated the average price (H+L+C)/3 for 60464 bars in 21us. > > This works out to about 8.5GF (billion floating point operations > per > second) and 46GB/s memory transfer speed. (read 3 floats and > write > one per bar), (2 floating point adds and 1 multiply per bar) > > The 46GB/s transfer rate is not far from the available memory > bandwidth on the card, but the simple test calculation is not > very "dense" so, I should be able to get a much higher > calculation > rate once I move more of my code to the graphics cores. Several > of > the CUDA demos report > 150GF/s. Memory is the bottleneck of > this > simple test. I used one thread per bar. > > High end graphics cards are available now that would improve > performance by another factor of 2 to 4. > > A few problems: > * The above numbers do not include the time needed to copy the > data > from ami to the graphics card or copy the results back. This > time is > much greater than the calculation time in this simple test. > * This is not a general AFL accelerator. > > My goal is to reduce my current 25s backtest time down to < 1s > per > pass. To do this, I will need to move the data set for all > symbols > to the graphics card once and make many passes over the data > with > different optimization values. Each CUDA thread will work on > one > symbol, rather than a thread per bar as in my first test. > > There is not much point in writing a CUDA routine to just > execute > directly from AFL code. There is too much overhead. In my > application, the AFL code is a very small part of the total time > for > each backtest. Even if I reduced the time to zero, it would not > reduce the time per pass very much. Also, the time needed to > copy > the price data on each pass would greatly reduce the benefit. > As far > as I can tell, the current Ami API does not allow injecting a > externally generated trade list into the backtest, so I will > need to > perform the full backtest and fitness function calculation > externally. > > I had no compatibility problems getting the CUDA api to run as a > Ami > plug in. > > Why go to the trouble? Using Fred's IO program would get much > of the > same benefit for less trouble, or I could wait until Ami finally > supports multi cores, or finds other clever ways to reduce the > per > pass overhead. The real answer is that I just had to try it.... > > > > > >
__._,_.___
Please note that this group is for discussion between users only.
To get support from AmiBroker please send an e-mail directly to
SUPPORT {at} amibroker.com
For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
http://www.amibroker.com/devlog/
For other support material please check also:
http://www.amibroker.com/support.html
__,_._,___
|