[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [amibroker] Re: Dual-core vs. quad-core



PureBytes Links

Trading Reference Links

Markus,

I have a quad core 2.8 GHz Intel Xeon with 12MB cache for each  
processor (Mac Pro).  I chose this model specifically because of the  
large cache so AB would run faster.  I am running from 1000 to 30,000  
bars of intraday data with many different arrays.  I use a parameter  
to set the max number of bars for my loops and use SetBarsRequired()  
with that parameter to set the min number of bars for Fast AFL  
operations.  I am only running one instance of AB currently, and I am  
only running indicators.  Of course only one core is ever active for  
AB.  I can however run other compute intensive programs without  
slowing down the AB core.

The only thing that really slows down AB for me is when I do an  
operation that requires more bars, or if I set my number of bars  
parameter to more bars.  The thing to watch out for though is once you  
require more bars for an operation, and then change what you are doing  
to something that requires less bars, Fast AFL will never reduce the  
number of bars it uses.  The only way I have found to reset it back to  
using the number of bars it really requires, is to restart AB, or edit  
the formula (no changes needed) and apply it again.

There are number of operations that AB can do that require all the  
bars (like ATC), so Fast AFL will not operate again after one use of  
those operations.  This may have no bearing on the type of operations  
you are doing (they may always require all bars), but I thought I  
should mention it just in case you do something like this and can not  
replicate you speed numbers (ie., it runs fast, you do something else,  
then it runs slow).

I don't use anything other than indicator mode,  so perhaps Tomasz can  
shed some more light on this situation in case there is something to  
watch out for in your tests.

Best regards,
Dennis



On May 13, 2008, at 3:24 PM, Tomasz Janeczko wrote:

> Hello,
>
> I just run the same code on my relatively new notebook (Core 2 Duo  
> 2GHz (T7250))
> and the loop takes less than 2ns per iteration (3x speedup). So it  
> looks like the data sits entirely inside the cache.
> This core 2 has 2MB of cache and thats 4 times more than on Athlon  
> x2 I got.
>
>> If what you say is true, and one core alone fills the memory
>> bandwidth, then there should be a net loss of performance while
>> running two copies of ami.
>
> It depends on complexity of the formula and the amount of data per  
> symbol
> you are using. As each array element has 4 bytes, to fill 4 MB of  
> cache
> you would need 1 million array elements or 100 arrays each having  
> 10000 elements
> or 10 arrays each having 100K elements. Generally speaking people  
> testing
> on EOD data where 10 years is just 2600 bars should see speed up.
> People using very very long intraday data sets may see degradation,  
> but
> rather unnoticeable.
>
> Best regards,
> Tomasz Janeczko
> amibroker.com
> ----- Original Message -----
> From: "dloyer123" <dloyer123@xxxxxxxxx>
> To: <amibroker@xxxxxxxxxxxxxxx>
> Sent: Tuesday, May 13, 2008 8:12 PM
> Subject: [amibroker] Re: Dual-core vs. quad-core
>
>
>> Nice, tight loop.  It is good to see someone that has made the effort
>> to make the most out of every cycle and the result shows.
>>
>> My new E8400 (45nm 3GHz, dual core) system should arrive tomorrow.
>> The first thing I will do will be to benchmark it running ami.  I run
>> portfolio backtests over a few years of 5 minute data over a thousand
>> or so symbols.  Plenty of data to overflow the cache, but still fit
>> in memory.  No trig.
>>
>> I'll post what I find.
>>
>> If what you say is true, and one core alone fills the memory
>> bandwidth, then there should be a net loss of performance while
>> running two copies of ami.
>>
>>
>>
>> --- In amibroker@xxxxxxxxxxxxxxx, "Tomasz Janeczko" <groups@xxx>
>> wrote:
>>>
>>> Hello,
>>>
>>> FYI: SINGLE processor core running an AFL formula is able to
>> saturate memory bandwidth
>>> in majority of most common operations/functions
>>> if total array sizes used in given formula exceedes DATA cache size.
>>>
>>> You need to understand that AFL runs with native assembly speed
>>> when using array operations.
>>> A simple array multiplication like this
>>>
>>> X = Close  * H; // array multiplication
>>>
>>> gets compiled to just 8 assembly instructions:
>>>
>>> loop:    8B 54 24 58          mov         edx,dword ptr [esp+58h]
>>> 00465068 46                   inc
>> esi                       ; increase counters
>>> 00465069 83 C0 04             add         eax,4
>>> 0046506C 3B F7                cmp         esi,edi
>>> 0046506E D9 44 B2 FC          fld         dword ptr [edx+esi*4-
>> 4]   ; get element of close array
>>> 00465072 D8 4C 08 FC          fmul        dword ptr [eax+ecx-
>> 4]     ; multiply by element of high array
>>> 00465076 D9 58 FC             fstp        dword ptr [eax-
>> 4]         ; store result
>>> 00465079 7C E9                jl
>> loop                      ; continue until all elements are processed
>>>
>>> As you can see there are three 4 byte memory accesses per loop
>> iteration (2 reads each 4 bytes long and 1 write 4 byte long)
>>>
>>> On my (2 year old) 2GHz Athlon x2 64 single iteration of this loop
>> takes 6 nanoseconds (see benchmark code below).
>>> So, during 6 nanoseconds we have 8 byte reads and 4 byte store.
>> Thats (8/(6e-9))  bytes per second = 1333 MB per second read
>>> and 667 MB per second write simultaneously i.e. 2GB/sec combined !
>>>
>>> Now if you look at memory benchmarks:
>>> http://community.compuserve.com/n/docs/docDownload.aspx?webtag=ws-
>> pchardware&guid=6827f836-8c33-4063-aaf5-c93605dd1dc6
>>> you will see that 2GB/s is THE LIMIT of system memory speed on
>> Athlon x64 (DDR2 dual channel)
>>> And that's considering the fact that Athlon has superior-to-intel
>> on-die integrated memory controller (hypertransfer)
>>>
>>> // benchmark code - for accurrate results run it on LARGE arrays -
>> intraday database, 1-minute interval, 50K bars or more)
>>> GetPerformanceCounter(1);
>>> for(k = 0; k < 1000; k++ ) X = C * H;
>>> "Time per single iteration [s]="+1e-3*GetPerformanceCounter()/
>> (1000*BarCount);
>>>
>>> Only really complex operations that use *lots* of FPU (floating
>> point) cycles
>>> such as trigonometric (sin/cos/tan) functions are slow enough for
>> the memory
>>> to keep up.
>>>
>>> Of course one may say that I am using "old" processor, and new
>> computers have faster RAM and that's true
>>> but processor speeds increase FASTER than bus speeds and the gap
>> between processor and RAM
>>> becomes larger and larger so with newer CPUs the situation will be
>> worse, not better.
>>>
>>>
>>> Best regards,
>>> Tomasz Janeczko
>>> amibroker.com
>>> ----- Original Message -----
>>> From: "dloyer123" <dloyer123@xxx>
>>> To: <amibroker@xxxxxxxxxxxxxxx>
>>> Sent: Tuesday, May 13, 2008 5:02 PM
>>> Subject: [amibroker] Re: Dual-core vs. quad-core
>>>
>>>
>>>> All of the cores have to share the same front bus and
>> northbridge.
>>>> The northbridge connects the cpu to memory and has limited
>> bandwidth.
>>>>
>>>> If several cores are running memory hungry applications, the
>> front
>>>> buss will saturate.
>>>>
>>>> The L2 cache helps for most applications, but not if you are
>> burning
>>>> through a few G of quote data.  The L2 cache is just 4-8MB.
>>>>
>>>> The newer multi core systems have much faster front buses and
>> that
>>>> trend is likely to continue.
>>>>
>>>> So, it would be nice if AMI could support running multi cores,
>> even
>>>> if it was just running different optimization passes on different
>>>> cores.  That would saturate the front bus, but take advantage of
>> all
>>>> of the memory bandwidth you have.  It would really help those
>> multi
>>>> day walkforward runs.
>>>>
>>>>
>>>>
>>>> --- In amibroker@xxxxxxxxxxxxxxx, "markhoff" <markhoff@> wrote:
>>>>>
>>>>>
>>>>> If you have a runtime penalty when running 2 independent AB jobs
>> on
>>>> a
>>>>> Core Duo CPU it might be caused by too less memory (swapping to
>>>> disk)
>>>>> or other tasks which are also running (e.g. a web browser, audio
>>>>> streamer or whatever). You can check this with a process explorer
>>>>> which shows each tasks CPU utilisation. Similar, 4 AB jobs on a
>> Core
>>>>> Quad should have nearly no penalty in runtime.
>>>>>
>>>>> Tomasz stated that multi-thread optimization does not scale good
>>>> with
>>>>> the CPU number, but it is not clear to me why this is the case.
>> In
>>>> my
>>>>> understanding, AA optimization is a sequential process of
>> running
>>>> the
>>>>> same AFL script with different parameters. If I have an AFL with
>>>>> significantly long runtime per optimization step (e.g. 1 minute)
>> the
>>>>> overhead for the multi-threading should become quite small and
>>>>> independent tasks should scale nearly with the number of CPUs
>> (as
>>>> long
>>>>> as there is sufficient memory, n threads might need n-times more
>>>>> memory than a single thread). For sure the situation is
>> different if
>>>>> my single optimization run takes only a few millisecs or
>> seconds,
>>>> then
>>>>> the overhead for multi-thread-managment goes up ...
>>>>>
>>>>> Maybe Tomasz can give some detailed comments on that issue?
>>>>>
>>>>> Best regards,
>>>>> Markus
>>>>>
>>>>
>>>>
>>>> ------------------------------------
>>>>
>>>> Please note that this group is for discussion between users only.
>>>>
>>>> To get support from AmiBroker please send an e-mail directly to
>>>> SUPPORT {at} amibroker.com
>>>>
>>>> For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
>>>> http://www.amibroker.com/devlog/
>>>>
>>>> For other support material please check also:
>>>> http://www.amibroker.com/support.html
>>>> Yahoo! Groups Links
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> ------------------------------------
>>
>> Please note that this group is for discussion between users only.
>>
>> To get support from AmiBroker please send an e-mail directly to
>> SUPPORT {at} amibroker.com
>>
>> For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
>> http://www.amibroker.com/devlog/
>>
>> For other support material please check also:
>> http://www.amibroker.com/support.html
>> Yahoo! Groups Links
>>
>>
>>
>
> ------------------------------------
>
> Please note that this group is for discussion between users only.
>
> To get support from AmiBroker please send an e-mail directly to
> SUPPORT {at} amibroker.com
>
> For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
> http://www.amibroker.com/devlog/
>
> For other support material please check also:
> http://www.amibroker.com/support.html
> Yahoo! Groups Links
>
>
>


------------------------------------

Please note that this group is for discussion between users only.

To get support from AmiBroker please send an e-mail directly to 
SUPPORT {at} amibroker.com

For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
http://www.amibroker.com/devlog/

For other support material please check also:
http://www.amibroker.com/support.html
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/amibroker/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/amibroker/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:amibroker-digest@xxxxxxxxxxxxxxx 
    mailto:amibroker-fullfeatured@xxxxxxxxxxxxxxx

<*> To unsubscribe from this group, send an email to:
    amibroker-unsubscribe@xxxxxxxxxxxxxxx

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/