Hello,
FYI: SINGLE processor core running an AFL formula is able
to saturate memory bandwidth
in majority of most common operations/functions
if total array sizes used in given formula exceedes DATA cache
size.
You need to understand that AFL runs with native assembly
speed
when using array operations.
A simple array multiplication like this
X = Close * H; // array multiplication
gets compiled to just 8 assembly instructions:
loop: 8B 54 24
58
mov edx,dword ptr
[esp+58h] 00465068
46
inc
esi ;
increase counters 00465069 83 C0
04
add eax,4 0046506C 3B
F7
cmp esi,edi 0046506E D9 44 B2
FC
fld dword ptr
[edx+esi*4-4] ; get element of close array 00465072 D8 4C 08
FC
fmul dword ptr
[eax+ecx-4] ; multiply by element of high
array 00465076 D9 58
FC
fstp dword ptr
[eax-4] ; store
result 00465079 7C
E9
jl
loop
; continue until all elements are processed
As you can see there are
three 4 byte memory accesses per loop iteration (2 reads each 4 bytes long and 1
write 4 byte long)
On my (2 year old) 2GHz Athlon x2 64 single iteration
of this loop takes 6 nanoseconds (see benchmark code below).
So, during 6 nanoseconds we have 8 byte reads and 4 byte
store. Thats (8/(6e-9)) bytes per second = 1333 MB per second
read
and 667 MB per second write simultaneously i.e. 2GB/sec
combined !
Now if you look at memory benchmarks:
you will see that 2GB/s is THE LIMIT of system memory
speed on Athlon x64 (DDR2 dual channel)
And that's considering the fact that Athlon has
superior-to-intel on-die integrated memory controller
(hypertransfer)
// benchmark code - for accurrate
results run it on LARGE arrays - intraday database, 1-minute interval, 50K bars
or more)
GetPerformanceCounter(1); for(k = 0; k < 1000; k++ ) X = C * H; "Time per single iteration [s]="+1e-3*GetPerformanceCounter()/(1000*BarCount);
Only really complex operations that use *lots* of
FPU (floating point) cycles
such as trigonometric (sin/cos/tan) functions are slow
enough for the memory
to keep up.
Of course one may say that I am using "old" processor, and new
computers have faster RAM and that's true
but processor speeds increase FASTER than bus speeds and the
gap between processor and RAM
becomes larger and larger so with newer CPUs the situation
will be worse, not better.
Best regards, Tomasz
Janeczko amibroker.com
----- Original Message -----
Sent: Tuesday, May 13, 2008 5:02 PM
Subject: [amibroker] Re: Dual-core vs.
quad-core
> All of the cores have to
share the same front bus and northbridge. > The northbridge
connects the cpu to memory and has limited bandwidth. > > If
several cores are running memory hungry applications, the front > buss
will saturate. > > The L2 cache helps for most applications, but
not if you are burning > through a few G of quote data. The L2
cache is just 4-8MB. > > The newer multi core systems have much
faster front buses and that > trend is likely to continue. >
> So, it would be nice if AMI could support running multi cores, even
> if it was just running different optimization passes on different
> cores. That would saturate the front bus, but take advantage of
all > of the memory bandwidth you have. It would really help those
multi > day walkforward runs. > > > > --- In
amibroker@xxxxxxxxxxxxxxx, "markhoff"
<markhoff@xxx> wrote: >> >> >> If you have a
runtime penalty when running 2 independent AB jobs on > a >>
Core Duo CPU it might be caused by too less memory (swapping to >
disk) >> or other tasks which are also running (e.g. a web browser,
audio >> streamer or whatever). You can check this with a process
explorer >> which shows each tasks CPU utilisation. Similar, 4 AB jobs
on a Core >> Quad should have nearly no penalty in runtime. >>
>> Tomasz stated that multi-thread optimization does not scale good
> with >> the CPU number, but it is not clear to me why this is
the case. In > my >> understanding, AA optimization is a
sequential process of running > the >> same AFL script with
different parameters. If I have an AFL with >> significantly long
runtime per optimization step (e.g. 1 minute) the >> overhead for the
multi-threading should become quite small and >> independent tasks
should scale nearly with the number of CPUs (as > long >> as
there is sufficient memory, n threads might need n-times more >> memory
than a single thread). For sure the situation is different if >> my
single optimization run takes only a few millisecs or seconds, >
then >> the overhead for multi-thread-managment goes up
... >> >> Maybe Tomasz can give some detailed comments on
that issue? >> >> Best regards, >>
Markus >> > > >
------------------------------------ > > Please note that this
group is for discussion between users only. > > To get support from
AmiBroker please send an e-mail directly to > SUPPORT {at}
amibroker.com > > For NEW RELEASE ANNOUNCEMENTS and other news
always check DEVLOG: > http://www.amibroker.com/devlog/ > >
For other support material please check also: > http://www.amibroker.com/support.html >
Yahoo! Groups Links > > <*> To visit your group on the web,
go to: > http://groups.yahoo.com/group/amibroker/ >
> <*> Your email settings: > Individual
Email | Traditional > > <*> To change settings online go
to: > http://groups.yahoo.com/group/amibroker/join > (Yahoo! ID required) > > <*> To
change settings via email: > mailto:amibroker-digest@xxxxxxxxxxxxxxx
> mailto:amibroker-fullfeatured@xxxxxxxxxxxxxxx > > <*> To unsubscribe from this group, send an email
to: > amibroker-unsubscribe@xxxxxxxxxxxxxxx >
> <*> Your use of Yahoo! Groups is subject
to: > http://docs.yahoo.com/info/terms/ >
__._,_.___
Please note that this group is for discussion between users only.
To get support from AmiBroker please send an e-mail directly to
SUPPORT {at} amibroker.com
For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
http://www.amibroker.com/devlog/
For other support material please check also:
http://www.amibroker.com/support.html
__,_._,___
|