Hello,
I just run the same code on my relatively new notebook (Core
2 Duo 2GHz (T7250))
and the loop takes less than 2ns per iteration (3x
speedup). So it looks like the data sits entirely inside the cache.
This
core 2 has 2MB of cache and thats 4 times more than on Athlon x2 I
got.
> If what you say is true, and one core alone fills the memory
> bandwidth, then there should be a net loss of performance while
> running two copies of ami.
It depends on complexity of the
formula and the amount of data per symbol
you are using. As each array
element has 4 bytes, to fill 4 MB of cache
you would need 1 million array
elements or 100 arrays each having 10000 elements
or 10 arrays each having
100K elements. Generally speaking people testing
on EOD data where 10 years
is just 2600 bars should see speed up.
People using very very long intraday
data sets may see degradation, but
rather unnoticeable.
Best
regards,
Tomasz Janeczko
amibroker.com
----- Original Message -----
From: "dloyer123" <dloyer123@xxxxxxcom>
To:
<amibroker@xxxxxxxxxps.com>
Sent:
Tuesday, May 13, 2008 8:12 PM
Subject: [amibroker] Re: Dual-core vs.
quad-core
> Nice, tight loop. It is good to see someone that has
made the effort
> to make the most out of every cycle and the result
shows.
>
> My new E8400 (45nm 3GHz, dual core) system should
arrive tomorrow.
> The first thing I will do will be to benchmark it
running ami. I run
> portfolio backtests over a few years of 5 minute
data over a thousand
> or so symbols. Plenty of data to overflow the
cache, but still fit
> in memory. No trig.
>
> I'll post
what I find.
>
> If what you say is true, and one core alone
fills the memory
> bandwidth, then there should be a net loss of
performance while
> running two copies of ami.
>
>
>
> --- In amibroker@xxxxxxxxxps.com,
"Tomasz Janeczko" <groups@xxx>
> wrote:
>>
>>
Hello,
>>
>> FYI: SINGLE processor core running an AFL
formula is able to
> saturate memory bandwidth
>> in majority
of most common operations/functions
>> if total array sizes used
in given formula exceedes DATA cache size.
>>
>> You need
to understand that AFL runs with native assembly speed
>> when using
array operations.
>> A simple array multiplication like
this
>>
>> X = Close * H; // array
multiplication
>>
>> gets compiled to just 8 assembly
instructions:
>>
>> loop: 8B 54 24 58 mov edx,dword ptr
[esp+58h]
>> 00465068 46 inc
> esi ; increase counters
>> 00465069 83 C0 04 add eax,4
>> 0046506C 3B F7 cmp
esi,edi
>> 0046506E D9 44 B2 FC fld dword ptr [edx+esi*4-
> 4]
; get element of close array
>> 00465072 D8 4C 08 FC fmul dword ptr
[eax+ecx-
> 4] ; multiply by element of high array
>> 00465076
D9 58 FC fstp dword ptr [eax-
> 4] ; store result
>> 00465079
7C E9 jl
> loop ; continue until all elements are processed
>>
>> As you can see there are three 4 byte memory
accesses per loop
> iteration (2 reads each 4 bytes long and 1 write 4
byte long)
>>
>> On my (2 year old) 2GHz Athlon x2 64
single iteration of this loop
> takes 6 nanoseconds (see benchmark code
below).
>> So, during 6 nanoseconds we have 8 byte reads and 4 byte
store.
> Thats (8/(6e-9)) bytes per second = 1333 MB per second
read
>> and 667 MB per second write simultaneously i.e. 2GB/sec
combined !
>>
>> Now if you look at memory
benchmarks:
>> http://community.compuserve.com/n/docs/docDownload.aspx?webtag=ws-
>
pchardware&guid=6827f836-8c33-4063-aaf5-c93605dd1dc6
>>
you will see that 2GB/s is THE LIMIT of system memory speed on
> Athlon
x64 (DDR2 dual channel)
>> And that's considering the fact that
Athlon has superior-to-intel
> on-die integrated memory controller
(hypertransfer)
>>
>> // benchmark code - for accurrate
results run it on LARGE arrays -
> intraday database, 1-minute
interval, 50K bars or more)
>> GetPerformanceCounter(1);
>> for(k = 0; k < 1000; k++ ) X = C * H;
>> "Time per
single iteration [s]="+1e-3*GetPerformanceCounter()/
>
(1000*BarCount);
>>
>> Only really complex operations
that use *lots* of FPU (floating
> point) cycles
>> such as
trigonometric (sin/cos/tan) functions are slow enough for
> the
memory
>> to keep up.
>>
>> Of course one may say
that I am using "old" processor, and new
> computers have faster RAM
and that's true
>> but processor speeds increase FASTER than bus
speeds and the gap
> between processor and RAM
>> becomes
larger and larger so with newer CPUs the situation will be
> worse, not
better.
>>
>>
>> Best regards,
>> Tomasz
Janeczko
>> amibroker.com
>> ----- Original Message -----
>> From: "dloyer123" <dloyer123@x..>
>> To:
<amibroker@xxxxxxxxxps.com>
>>
Sent: Tuesday, May 13, 2008 5:02 PM
>> Subject: [amibroker] Re:
Dual-core vs. quad-core
>>
>>
>> > All of the
cores have to share the same front bus and
> northbridge.
>>
> The northbridge connects the cpu to memory and has limited
>
bandwidth.
>> >
>> > If several cores are running
memory hungry applications, the
> front
>> > buss will
saturate.
>> >
>> > The L2 cache helps for most
applications, but not if you are
> burning
>> > through a
few G of quote data. The L2 cache is just 4-8MB.
>> >
>>
> The newer multi core systems have much faster front buses and
>
that
>> > trend is likely to continue.
>> >
>> > So, it would be nice if AMI could support running multi
cores,
> even
>> > if it was just running different
optimization passes on different
>> > cores. That would saturate
the front bus, but take advantage of
> all
>> > of the
memory bandwidth you have. It would really help those
> multi
>> > day walkforward runs.
>> >
>> >
>> >
>> > --- In amibroker@xxxxxxxxxps.com,
"markhoff" <markhoff@> wrote:
>> >>
>> >>
>> >> If you have a runtime penalty when running 2 independent
AB jobs
> on
>> > a
>> >> Core Duo CPU it
might be caused by too less memory (swapping to
>> >
disk)
>> >> or other tasks which are also running (e.g. a web
browser, audio
>> >> streamer or whatever). You can check this
with a process explorer
>> >> which shows each tasks CPU
utilisation. Similar, 4 AB jobs on a
> Core
>> >> Quad
should have nearly no penalty in runtime.
>> >>
>>
>> Tomasz stated that multi-thread optimization does not scale good
>> > with
>> >> the CPU number, but it is not
clear to me why this is the case.
> In
>> > my
>>
>> understanding, AA optimization is a sequential process of
>
running
>> > the
>> >> same AFL script with
different parameters. If I have an AFL with
>> >> significantly
long runtime per optimization step (e.g. 1 minute)
> the
>>
>> overhead for the multi-threading should become quite small
and
>> >> independent tasks should scale nearly with the number
of CPUs
> (as
>> > long
>> >> as there is
sufficient memory, n threads might need n-times more
>> >>
memory than a single thread). For sure the situation is
> different
if
>> >> my single optimization run takes only a few millisecs
or
> seconds,
>> > then
>> >> the overhead
for multi-thread-managment goes up ...
>> >>
>>
>> Maybe Tomasz can give some detailed comments on that
issue?
>> >>
>> >> Best regards,
>>
>> Markus
>> >>
>> >
>> >
>> >
------------------------------------
>> >
>> > Please note that this group is for discussion between users
only.
>> >
>> > To get support from AmiBroker please
send an e-mail directly to
>> > SUPPORT {at}
amibroker.com
>> >
>> > For NEW RELEASE ANNOUNCEMENTS
and other news always check DEVLOG:
>> > http://www.amibroker.com/devlog/
>>
>
>> > For other support material please check
also:
>> > http://www.amibroker.com/support.html
>>
> Yahoo! Groups Links
>> >
>> >
>>
>
>>
>
>
>
>
------------------------------------
>
> Please
note that this group is for discussion between users only.
>
> To
get support from AmiBroker please send an e-mail directly to
> SUPPORT
{at} amibroker.com
>
> For NEW RELEASE ANNOUNCEMENTS and other
news always check DEVLOG:
> http://www.amibroker.com/devlog/
>
> For other support material please check also:
> http://www.amibroker.com/support.html
>
Yahoo! Groups Links
>
>
>