PureBytes Links
Trading Reference Links
|
On Mar 28, 8:25pm, Mark Johnson wrote:
> Subject: shaving microseconds (avoiding N * logN behavior)
> I've got a system that does an EXPENSIVE calculation
> and right now I'm using an algorithm with O(N * logN)
> asymptotic behavior. I hope somebody can offer a
> suggestion on how to reduce this to O(N), or better
> yet, O(logN).
>
> Problem: for each day in the dataset, look back at
> the N most recent trading days and return the
> K'th largest Open price among those N bars.
>
> Notice that N and K are variables ("Inputs" actually);
> N and K are *not* constant. It is expected that users
> will operate with N =approx= 120 and with K =approx= 50
> or thereabouts. But it is mandatory to allow users
> to modify N and K. We cannot "hardwire" them to
> fixed constant values.
Mark, this process is known as 'selection', and is
described in Sedgewick's "Algorithms" (2nd ed.)
book on pp 126-130. I've attached a C implementation
below, which uses recursion. Sedgewick's algorithm
on p. 128 is more compact, in Pascal, and is
non-recursive. The algorithm is a partitioning
process similar to that used in Quicksort, which
Sedgewick says is O(N) (ie, runs in linear time
on average). The problem is that it doesn't
completely sort the array, so deletion of the
N-th most recent element becomes an issue, and
that would best be solved by using an 'indirect'
array with pointers to the locations of the N-th
most recent element to be added to the array.
Since 'k' is fixed once you start calling this function,
it would probably be possible to 'cache' the most
recent values of the bookkeeping variables so they
don't have to be adjusted, except when an item is
deleted, and a new item added. That is probably
the way to go.
Alternatively, a balanced binary tree, using
array indexes rather than pointers would give
you roughly log2(N) performance, but the algorithm
would have to be modified to keep track of how
many elements are to the left and right of each
tree node, rather than simply recording the
+/- imbalance. This is probably the fastest,
but not for the faint of heart.
int
smallk (data, n, k)
int data[];
int n;
int k;
{
int first = data[0];
int i, j;
int this;
int nlt, nle, ngt;
/*
* find everything less than first,
* and place it at the beginning of 'data'
*/
nlt = 0;
for (i = 1; i < n; ++i)
{
this = data[i];
if (this < first)
{
if (i > nlt)
{
data[i] = data[nlt];
data[nlt] = this;
}
++nlt;
}
}
/* at this point data[0..(nlt-1)] has all entries less than first */
if (k <= nlt)
{
/* case a: k is less than or equal to 'nlt', so the k-th smallest must be
* in the first 'nlt' items. No need to do the other steps,
* just call smallk passing the first 'nlt' items.
*/
return smallk (data, nlt, k);
}
/*
* next find all entries equal to first, and place them after
* the group that is less. The first element in the equal group
* begins at data[nlt].
*/
nle = nlt;
for (i = nlt; i < n; ++i)
{
this = data[i];
if (this == first)
{
if (i > nle) {
data[i] = data[nle];
data[nle] = this;
}
++nle;
}
}
/*
* by definition, the remaining entries are greater than first,
* and we can caluculate how many there are by subtraction.
*/
ngt = n - nle;
if (k <= nle)
{
/* case b: k is less than or equal to 'nle'. In this case
* the k-smallest has been found, and it is equal to 'first'.
*/
return first;
}
/* at this point,
* data[0..(nlt-1)] has all entries less than first
* data[nlt..(nle-1)] has all entries equal to first
* data[nle..(n-1)] has all entries greater than first
*
* case c: call smallk recursively passing the elements greater
* than first. Reduce the value of 'k' by 'nle', becasue we've
* determined that many were <= first
*/
return smallk (&data[nle], ngt, k - nle);
}
|