[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[amibroker] Re: Freakishly fast backtest using 64 cores



PureBytes Links

Trading Reference Links

Another very good question.

I found that rather than generate full set of 8 arrays on the card to 
hold buy/sell/buyprice/sellprice, etc, It was simpler and faster to 
just create a signal list.  

The signal list is very small compared to the full arrays and feeds 
the fitness function.  This runs on the host system, the normal CPU.  
It matches the signals and does a portfolio simulation, along with 
the fitness function.  It does not have every bell and whistle that 
AmiBroker's does, but it has enough.  

In optimization mode, the fitness function is to find the best run so 
far and another run is made.  

In other modes, the Dll creates a set of buy/sell/buyprice/sellprice, 
etc arrays, sets them according to the signal list and sets the 
amibroker arrays "cudaBuy", "cudaBuyPrice", etc.  These can be 
assigned to "buy","sell", ect or used to compare results.

So, the it runs a full portfolio backtest.  And, all uses cases, 
backtest, indicator, optimize, walkforward work the way you would 
expect.  


--- In amibroker@xxxxxxxxxxxxxxx, "Paul Ho" <paul.tsho@xxx> wrote:
>
> Thanks
> This has the potential to revolutionise the way we do backtest
> I still have a couple of questions if  I may
> In mode 2 and 3 does your dll function return the 
buy/sell/short/cover
> arrays together with buyprice/sellprice.... etc for every symbol 
that was
> backtested. I presume if you did, writing of array back into main 
memory
> would slow things down a lot.
> I'm not sure from your reply whether you do your own portfolio 
backtest/and
> optimization within the video card or do you just do optimization 
on the 1st
> symbol only, and if that is the case, what do you do for the rest 
of the
> symbols. Do you just use the optimized parameters from the 1st 
symbol.
> Am I correct to assume to get the optimium performance, you really 
need to
> do all the optimization and portfolio testing inside the Video card?
> Am I correct to say that, you are reproducing the results of 
backtest and
> optimization within the video card which also means that
> something like in-line portfolio backtesting would be done in an 
instance,
> and can be performed in indicator mode! (I'm not sure if Herman is 
reading
> this).
> Any way good work, and thanks for sharing.
> I have downloaded the CUDA files and as soon as I find time, I'm 
going to
> have a play.
> Cheers
> Paul.
>  
> 
> 
>   _____  
> 
> From: amibroker@xxxxxxxxxxxxxxx [mailto:amibroker@xxxxxxxxxxxxxxx] 
On Behalf
> Of dloyer123
> Sent: Wednesday, 6 August 2008 2:22 PM
> To: amibroker@xxxxxxxxxxxxxxx
> Subject: [amibroker] Re: Freakishly fast backtest using 64 cores
> 
> 
> 
> Very good question. That was a head scratcher. 
> 
> So the thing is, AmiBroker does a lot more work in a optimization 
> pass then execute AFL code. In fact, the AFL code may take very 
> little of the total run time.
> 
> As an example, using a database with good amount of data, write a 
afl 
> file that does nothing buy set the buy/sell/short/cover arrays to 
0. 
> The backtest will still take a good bit of time. 
> 
> So, even reducing the AFL run time to zero is not enough. It will 
> not help much at all.
> 
> So, to avoid this, I pass a "mode" variable to my Dll. This mode is 
> set by a simple optimization statement:
> 
> mode = optimize("mode",0,1,3,1);
> 
> When mode = 0, the dll will evaluate one symbol like a normal dll. 
> So if I click on a bar, it will update my printf statements, etc. 
> buy/sell/short/cover arrays are set. A single backtest (not 
> optimize) will use the normal AmiBroker trade match and evaluate 
code 
> and generate stats as normal.
> 
> When mode = 1, this means load the data. The Dll will copy the 
price 
> data to a stage area in memory. buy/sell/short/cover are set to 0 
to 
> generate no trades. Having AmiBroker align the symbol bars was a 
big 
> help here.
> 
> When mode = 2, on the first symbol and the first symbol only, it 
> loads the price data to the video card and executes as many 
backtest 
> passes as it needs at a few ms per pass. Once the best combination 
> is found it returns. buy/sell/short/cover are set to 0. Note that I 
> can not use the Amibroker signal match and fitness function code. I 
> have to provide my own. This is where the performance advantage of 
> all of the extra cores come into play. It may run hundreds or 
> thousands of parameter combinations very quickly. I cant use the 
> built in optimize suppport, but brute force is enough for now. 
After 
> all, I get 200 combinations per second.
> 
> When mode = 3, each symbol evaluates using the best parms found on 
> the last mode=2 run. buy/sell/short/cover are set. In a walkforward 
> test, this will always have the best score and be used for the 
> walkforward step. A custom backtest function adds the chosen 
> parameters to the backtest report. Mode 3 works like mode 0 except 
> it uses the optimal parameters rather than defualt values.
> 
> The action("status") and action("statusex") codes could also be 
used, 
> but they did not tell me quite what I needed to know. Also, I could 
> have avoided the mode=2 step if I could find a way to know I was on 
> the last symbol and run the optimization then. I guess I could pass 
> the name of the last symbol. 
> 
> So I use AmiBroker to load and keep the datbase, visualize the 
> trades, validate, walkforward and provide deep metrics of the 
> backtest. 
> 
> If I wanted to take this further, I would move the trade system 
logic 
> out of the Dll and make it programable from Afl. That way it could 
> be used by anyone without needing to program C. I would do this by 
> passing handles to cuda arrays through the Afl code. 
> 
> --- In amibroker@xxxxxxxxx <mailto:amibroker%40yahoogroups.com> 
ps.com,
> "Paul Ho" <paul.tsho@> wrote:
> >
> > thanks for your insight.
> > I hope you dont mind sharing a little bit more detail
> > You said "
> > Get get the best performance, my AFL code makes one pass over the 
> > > data, calling a Dll. The Dll takes all of the data needed by 
the 
> > > calculation and loads a copy to the video card. This upload is 
> > slow, 
> > > the entire upload takes about 45 seconds for all 1000 symbols. 
> > > 
> > > Once all of the data is uploaded, the Dll loads a "kernel" into 
> the 
> > > graphics cores that perform the actual computation and 
generates 
> > the 
> > > trade list. 
> > 
> > normally AB loads the data from database as needed, and calls a 
> > function in a dll, and passes data in arrays or whatever as 
> arguments 
> > of the function. The function will be called for every ticker in 
> the 
> > watchlist, and data pertaining that symbol is passed each time. I 
> > wonder how you do a "single pass" over the data. Because AB 
passes 
> > the data as part of the argument regardless of how many 
> optimizations 
> > It had previously with the same data. I just wonder you do it.
> > cheers
> > Paul.
> > 
> > --- In amibroker@xxxxxxxxx <mailto:amibroker%40yahoogroups.com> 
ps.com,
> "dloyer123" <dloyer123@> wrote:
> > >
> > > This uses the mid range video card that happened to come with 
my 
> > > system, a 9800GT. The newer 260 and 280 cards are 3 to 4 times 
> > > faster. The 260 can be found at best buy for $300. Some laptops 
> > > have compatible cards as well. 
> > > 
> > > The video card has its own memory, mine has 512MB, some have as 
> > much 
> > > as 1GB. This memory is very fast, once it is loaded from the 
> main 
> > > system. Nvidia has a professional line of products that have 
> much 
> > > more memory. 
> > > 
> > > Get get the best performance, my AFL code makes one pass over 
the 
> > > data, calling a Dll. The Dll takes all of the data needed by 
the 
> > > calculation and loads a copy to the video card. This upload is 
> > slow, 
> > > the entire upload takes about 45 seconds for all 1000 symbols. 
> > > 
> > > Once all of the data is uploaded, the Dll loads a "kernel" into 
> the 
> > > graphics cores that perform the actual computation and 
generates 
> > the 
> > > trade list. This part is very fast and performs all of the same 
> > > functions that my AFL version does. The resulting trade list is 
> > the 
> > > same. 
> > > 
> > > Because the data loaded into video memory, it can be resused 
for 
> > many 
> > > passes over the data with different optimization values. So, 
> > > hundreds of combinations of optimization values can be tried 
per 
> > > second. 
> > > 
> > > For non optimization runs, the Dll just loads one symbol into 
> video 
> > > memory and processes it. Counting the overhead of moving data 
to 
> > the 
> > > video card and extracting the trade list for a single symbol, 
the 
> > > result is similar to AFL code alone. This lets me test the code 
> > and 
> > > make sure it is correct.
> > > 
> > > This approach works best when the data only needs to be loaded 
> > once, 
> > > then "resused" many times. It also works best when there is a 
> lot 
> > of 
> > > data to work with. 
> > > 
> > > What is more interesting to me and what would be more useful 
for 
> > > others would be a general drive that requires no Dll changes to 
> > > modify the system. The performance would not be as good as hand 
> > > optimized code, but would still be much better than AFL code 
> > alone. 
> > > It would take trading system design to a whole new level. It 
> would 
> > > provide enough performance to make working with Intra day data 
as 
> > > easy as daily data is today.
> > > 
> > > Writing such a driver would be hard, but I have already done 
some 
> > > prototypes and design work. I am tempted to do it for my own 
> use. 
> > > If I made it available to others supporting it would be a PITA. 
> > > 
> > > 
> > > 
> > > 
> > > --- In amibroker@xxxxxxxxx <mailto:amibroker%40yahoogroups.com> 
ps.com,
> "Paul Ho" <paul.tsho@> wrote:
> > > >
> > > > I'm very interested
> > > > could you elaborate a bit more
> > > > What model of Nvidia chipset are you using, and with how much 
> > > memory?
> > > > Not sure exactly what you mean when you say
> > > > It uses AmiBroker to load the symbol data and perform 
> > calculations 
> > > > that do not depend on the optimization parameters. Once 
loaded 
> > into 
> > > > video memory, repeated passes can be made with different 
> > > parameters, 
> > > > avoiding any overhead. 
> > > > Can you give me some examples. I presume when your dll is 
> called. 
> > > AB passes
> > > > one or more arrays of data belonging to 1 symbol, is that 
true?
> > > > Not sure exactly what the rest mean either. How many 
functions 
> > are 
> > > you
> > > > running in your dll, and what does each of the do?
> > > > Great of you to share your insight.
> > > > Cheers
> > > > Paul.
> > > > 
> > > > 
> > > > 
> > > > _____ 
> > > > 
> > > > From: amibroker@xxxxxxxxx <mailto:amibroker%
40yahoogroups.com> ps.com 
> > [mailto:amibroker@xxxxxxxxx <mailto:amibroker%40yahoogroups.com> 
ps.com] 
> > > On Behalf
> > > > Of dloyer123
> > > > Sent: Tuesday, 5 August 2008 9:19 AM
> > > > To: amibroker@xxxxxxxxx <mailto:amibroker%40yahoogroups.com> 
ps.com
> > > > Subject: [amibroker] Freakishly fast backtest using 64 cores
> > > > 
> > > > 
> > > > 
> > > > Greetings,
> > > > 
> > > > I ported part of my AFL backtest code to a plugin, that takes 
> > > > advantage of the graphics math cores on the video card that 
are 
> > > > normally used for 3d graphics. 
> > > > 
> > > > I was able to get a several thousand fold performance 
> improvement 
> > > > over AFL code alone.
> > > > 
> > > > My goal was to reduce the 25 seconds AFL code alone uses for 
a 
> > > single 
> > > > portfolio level back test to less than 1 second, allowing 
multi 
> > day 
> > > > optimization and walkforward runs to complete in a more 
> > reasonable 
> > > > time, and also just to see how fast I could get it to run.
> > > > 
> > > > The backtest runs over 1 year of 5 minute bars for about 1000 
> > > > symbols. 1 year of data normally takes 25 seconds for 
AmiBroker 
> > > > alone, or 18 seconds for 6 months of data. A typical 
> optimization 
> > > > run takes hundreds of these passes per walk forward step, 
> taking 
> > > > hours.
> > > > 
> > > > Using the Nvidia CUDA API, running on my mid range video 
card. 
> It 
> > > > was much faster. Much, much, much faster. How fast?
> > > > 
> > > > It reduced the run time from 25s to... 4.4ms. That is more 
than 
> > > > 200/s! 
> > > > 
> > > > I didnt believe the timing when I saw it at first. So, I put 
> > 1,000 
> > > > runs in a loop and sure enough, it ran 1,000 iterations in 
> about 
> > 4 
> > > > 1/2 seconds. This far exceeded my gaol or expectations.
> > > > 
> > > > The resulting trade list matches that obtained by the AFL 
> version 
> > > of 
> > > > this code. 
> > > > 
> > > > I estimate that it is processing 32GB of bar data/sec.
> > > > 
> > > > Getting this to work at peak performance was tricky. Most of 
> what 
> > I 
> > > > have learned about code optimization does not apply. 
> > > > 
> > > > It uses AmiBroker to load the symbol data and perform 
> > calculations 
> > > > that do not depend on the optimization parameters. Once 
loaded 
> > into 
> > > > video memory, repeated passes can be made with different 
> > > parameters, 
> > > > avoiding any overhead. 
> > > > 
> > > > For non backtest/optimization runs, the code just evaluates 
one 
> > > > symbol and passes the data back to AmiBroker 
> buy/sell/short/cover 
> > > > arrays, making it easy to test, validate and visualize the 
> > trades. 
> > > > There is very little performance gain in this case. 
> > > > 
> > > > There are problems, however. To run optimizations at peak 
> speed, 
> > I 
> > > > can not use AmiBroker to calculate the optimization goal 
> > function. 
> > > > So, I am in the process of writing code to match signals and 
> > > > calculate the portfolio fitness function. Once I do this, I 
> will 
> > be 
> > > > able to perform full optimizations and walk forwards at 3 
> orders 
> > of 
> > > > magnitude faster than is possible with AmiBroker alone.
> > > > 
> > > > Also, this is not general purpose code. Changing the system 
> code 
> > > > means changing a dll written in C. However, there is no 
reason 
> > that 
> > > > this could not be made more general. 
> > > > 
> > > > I have made some prototypes of "Cuda" versions of basic AFL 
> > > > functions. The idea is to queue the function calls into a 
> > > definition 
> > > > executed by a micro kernel running on the graphics cores. The 
> > > result 
> > > > would be the ability to use the full power of the graphics 
> cores 
> > by 
> > > > modifying AFL code to use Cuda aware versions with no changes 
> to 
> > C 
> > > > code. It would be an interesting, but big project.
> > > >
> > >
> >
>



------------------------------------

Please note that this group is for discussion between users only.

To get support from AmiBroker please send an e-mail directly to 
SUPPORT {at} amibroker.com

For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
http://www.amibroker.com/devlog/

For other support material please check also:
http://www.amibroker.com/support.html
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/amibroker/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/amibroker/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:amibroker-digest@xxxxxxxxxxxxxxx 
    mailto:amibroker-fullfeatured@xxxxxxxxxxxxxxx

<*> To unsubscribe from this group, send an email to:
    amibroker-unsubscribe@xxxxxxxxxxxxxxx

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/