Hello,
I just run the same code on my relatively new notebook
(Core 2 Duo 2GHz (T7250))
and the loop takes less than 2ns per iteration
(3x speedup). So it looks like the data sits entirely inside the cache.
This core 2 has 2MB of cache and thats 4 times more than on Athlon x2 I
got.
> If what you say is true, and one core alone fills the
memory
> bandwidth, then there should be a net loss of performance
while
> running two copies of ami.
It depends on complexity
of the formula and the amount of data per symbol
you are using. As each
array element has 4 bytes, to fill 4 MB of cache
you would need 1 million
array elements or 100 arrays each having 10000 elements
or 10 arrays each
having 100K elements. Generally speaking people testing
on EOD data where
10 years is just 2600 bars should see speed up.
People using very very
long intraday data sets may see degradation, but
rather
unnoticeable.
Best regards,
Tomasz
Janeczko
amibroker.com
----- Original Message -----
From:
"dloyer123" <dloyer123@xxxxxxcom>
To:
<amibroker@xxxxxxxxxps.com>
Sent:
Tuesday, May 13, 2008 8:12 PM
Subject: [amibroker] Re: Dual-core vs.
quad-core
> Nice, tight loop. It is good to see someone that has
made the effort
> to make the most out of every cycle and the result
shows.
>
> My new E8400 (45nm 3GHz, dual core) system should
arrive tomorrow.
> The first thing I will do will be to benchmark it
running ami. I run
> portfolio backtests over a few years of 5 minute
data over a thousand
> or so symbols. Plenty of data to overflow the
cache, but still fit
> in memory. No trig.
>
> I'll
post what I find.
>
> If what you say is true, and one core
alone fills the memory
> bandwidth, then there should be a net loss
of performance while
> running two copies of ami.
>
>
>
> --- In amibroker@xxxxxxxxxps.com,
"Tomasz Janeczko" <groups@xxx>
> wrote:
>>
>>
Hello,
>>
>> FYI: SINGLE processor core running an AFL
formula is able to
> saturate memory bandwidth
>> in
majority of most common operations/functions
>> if total array
sizes used in given formula exceedes DATA cache size.
>>
>> You need to understand that AFL runs with native assembly
speed
>> when using array operations.
>> A simple array
multiplication like this
>>
>> X = Close * H; // array
multiplication
>>
>> gets compiled to just 8 assembly
instructions:
>>
>> loop: 8B 54 24 58 mov edx,dword ptr
[esp+58h]
>> 00465068 46 inc
> esi ; increase counters
>> 00465069 83 C0 04 add eax,4
>> 0046506C 3B F7 cmp
esi,edi
>> 0046506E D9 44 B2 FC fld dword ptr [edx+esi*4-
>
4] ; get element of close array
>> 00465072 D8 4C 08 FC fmul dword
ptr [eax+ecx-
> 4] ; multiply by element of high array
>>
00465076 D9 58 FC fstp dword ptr [eax-
> 4] ; store result
>>
00465079 7C E9 jl
> loop ; continue until all elements are processed
>>
>> As you can see there are three 4 byte memory
accesses per loop
> iteration (2 reads each 4 bytes long and 1 write
4 byte long)
>>
>> On my (2 year old) 2GHz Athlon x2 64
single iteration of this loop
> takes 6 nanoseconds (see benchmark
code below).
>> So, during 6 nanoseconds we have 8 byte reads and 4
byte store.
> Thats (8/(6e-9)) bytes per second = 1333 MB per second
read
>> and 667 MB per second write simultaneously i.e. 2GB/sec
combined !
>>
>> Now if you look at memory
benchmarks:
>> http://community.compuserve.com/n/docs/docDownload.aspx?webtag=ws-
>
pchardware&guid=6827f836-8c33-4063-aaf5-c93605dd1dc6
>>
you will see that 2GB/s is THE LIMIT of system memory speed on
>
Athlon x64 (DDR2 dual channel)
>> And that's considering the fact
that Athlon has superior-to-intel
> on-die integrated memory
controller (hypertransfer)
>>
>> // benchmark code - for
accurrate results run it on LARGE arrays -
> intraday database,
1-minute interval, 50K bars or more)
>>
GetPerformanceCounter(1);
>> for(k = 0; k < 1000; k++ ) X
= C * H;
>> "Time per single iteration
[s]="+1e-3*GetPerformanceCounter()/
> (1000*BarCount);
>>
>> Only really complex operations that use *lots* of
FPU (floating
> point) cycles
>> such as trigonometric
(sin/cos/tan) functions are slow enough for
> the memory
>>
to keep up.
>>
>> Of course one may say that I am using
"old" processor, and new
> computers have faster RAM and that's
true
>> but processor speeds increase FASTER than bus speeds and
the gap
> between processor and RAM
>> becomes larger and
larger so with newer CPUs the situation will be
> worse, not
better.
>>
>>
>> Best regards,
>>
Tomasz Janeczko
>> amibroker.com
>> ----- Original Message
-----
>> From: "dloyer123" <dloyer123@x..>
>>
To: <amibroker@xxxxxxxxxps.com>
>>
Sent: Tuesday, May 13, 2008 5:02 PM
>> Subject: [amibroker] Re:
Dual-core vs. quad-core
>>
>>
>> > All of
the cores have to share the same front bus and
> northbridge.
>> > The northbridge connects the cpu to memory and has limited
> bandwidth.
>> >
>> > If several cores are
running memory hungry applications, the
> front
>> >
buss will saturate.
>> >
>> > The L2 cache helps
for most applications, but not if you are
> burning
>> >
through a few G of quote data. The L2 cache is just 4-8MB.
>> >
>> > The newer multi core systems have much faster front buses
and
> that
>> > trend is likely to continue.
>>
>
>> > So, it would be nice if AMI could support running
multi cores,
> even
>> > if it was just running
different optimization passes on different
>> > cores. That
would saturate the front bus, but take advantage of
> all
>> > of the memory bandwidth you have. It would really help
those
> multi
>> > day walkforward runs.
>>
>
>> >
>> >
>> > --- In amibroker@xxxxxxxxxps.com,
"markhoff" <markhoff@> wrote:
>> >>
>>
>>
>> >> If you have a runtime penalty when running 2
independent AB jobs
> on
>> > a
>> >>
Core Duo CPU it might be caused by too less memory (swapping to
>>
> disk)
>> >> or other tasks which are also running (e.g.
a web browser, audio
>> >> streamer or whatever). You can
check this with a process explorer
>> >> which shows each
tasks CPU utilisation. Similar, 4 AB jobs on a
> Core
>>
>> Quad should have nearly no penalty in runtime.
>> >>
>> >> Tomasz stated that multi-thread optimization does not
scale good
>> > with
>> >> the CPU number, but
it is not clear to me why this is the case.
> In
>> >
my
>> >> understanding, AA optimization is a sequential
process of
> running
>> > the
>> >> same
AFL script with different parameters. If I have an AFL with
>>
>> significantly long runtime per optimization step (e.g. 1 minute)
> the
>> >> overhead for the multi-threading should
become quite small and
>> >> independent tasks should scale
nearly with the number of CPUs
> (as
>> >
long
>> >> as there is sufficient memory, n threads might
need n-times more
>> >> memory than a single thread). For
sure the situation is
> different if
>> >> my single
optimization run takes only a few millisecs or
> seconds,
>> > then
>> >> the overhead for
multi-thread-managment goes up ...
>> >>
>>
>> Maybe Tomasz can give some detailed comments on that
issue?
>> >>
>> >> Best regards,
>>
>> Markus
>> >>
>> >
>> >
>> >
------------------------------------
>> >
>> > Please note that this group is for discussion between
users only.
>> >
>> > To get support from AmiBroker
please send an e-mail directly to
>> > SUPPORT {at}
amibroker.com
>> >
>> > For NEW RELEASE
ANNOUNCEMENTS and other news always check DEVLOG:
>> > http://www.amibroker.com/devlog/
>>
>
>> > For other support material please check
also:
>> > http://www.amibroker.com/support.html
>>
> Yahoo! Groups Links
>> >
>> >
>>
>
>>
>
>
>
>
------------------------------------
>
> Please
note that this group is for discussion between users only.
>
>
To get support from AmiBroker please send an e-mail directly to
>
SUPPORT {at} amibroker.com
>
> For NEW RELEASE ANNOUNCEMENTS
and other news always check DEVLOG:
> http://www.amibroker.com/devlog/
>
> For other support material please check also:
> http://www.amibroker.com/support.html
>
Yahoo! Groups Links
>
>
>