Fast Array Operations

tabular.fast is a module containing efficient NumPy algorithms for solving list-membership problems.

Motivation

NumPy arrays are typically used for “matrix math” operations (e.g. linear algebra), while list-membership queries are often handled in Python itself. For instance, consider the PathAlong problem: given two Python lists of file paths F1 and F2 (in reduced form), find all indices of paths in F1 that are “above”: in the file-system sense, paths in F2, together with the corresponding sets of contained paths in F2 for each path in F1. For example, if:

F1 = ['../Data/Dan_Data/', '../Users/DanYamins/','../Users/ElaineAngelino/']

F2 = ['../Data/Dan_Data/NPR_Puzzle_Solutions',  '../Data/Dan_Data/NPR_Puzzle_Solutions/AmericaPensacolaPuzzle/', '../Data/Dan_Data/RandomData','../Users/DanYamins/Finance/']

then find out that:

  • F1[0] is “path above” all items in F2[0:3]
  • F1[1] is “path above” all items in F2[3:4]
  • F2[2] is “path above” no items in F2.

A typical python solution to the PathAlong problem is:

def PathAlong(F1,F2):
        L = []
        for i in F1:
                Alongs = []
                for j in F2:
                        if j[:len(i)] == i and ((len(j) > len(i) and j[len(i)]=='/') or i[-1]=='/'):
                                Alongs.append(j)
                L.append((i,Alongs))
        return L

Though this solution works perfectly well, it is quite slow when F1 and F2 are both large lists.

It turns out that the PathAlong problem, along with many other typical list-membership query problems, can be solved with fast NumPy algorithms – as long as the data types of the lists involved are sufficiently uniform that the lists can be cast into NumPy arrays. The tabular.fast module contains a set of “building blocks” for such algorithms. The algorithms are especially useful for constructing efficient spread-sheet style operations on large tabular datasets.

Membership Testing

isin

tabular.fast.isin() is a fast routine for determining indices of elements in one numpy array that appear in another, returning a boolean array of those indices.

>>> import tabular.fast as fast
>>> import numpy, time
>>> X = numpy.random.randint(10000,size=(10000,))
>>> Y = numpy.random.randint(4000,size=(10000,))
>>> Z = fast.isin(X,Y)
>>> X[0:10]
array([3383, 8715, 4990, 3314, 3765, 1212, 1090, 8958, 1679,  862])
>>> Z[0:10]
array([ True, False, False,  True,  True,  True,  True, False,  True,  True], dtype=bool)

isin works not just on numerical arrays, but also on string arrays:

>>> L = numpy.array(['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'])
>>> [V1,V2,V3] = [numpy.random.randint(26,size=(10000,)),numpy.random.randint(26,size=(10000,)),numpy.random.randint(26,size=(10000,))]
>>> H = numpy.array([''.join(a) for a in zip(L[V1],L[V2],L[V3])])
>>> print H[:10]
['xhq' 'dkh' 'gfe' 'uef' 'lbt' 'ped' 'lem' 'cyp' 'jli' 'wth']
>>> [V1,V2,V3] = [numpy.random.randint(26,size=(10000,)),numpy.random.randint(26,size=(10000,)),numpy.random.randint(26,size=(10000,))]
>>> H2 = numpy.array([''.join(a) for a in zip(L[V1],L[V2],L[V3])]
>>> print H2[:10]
['zhz' 'klb' 'qnr' 'noq' 'mzo' 'mzr' 'mjk' 'cmu' 'dnr' 'klb']
>>> t = time.time() ; Z = fast.isin(H,H2) ; time.time() - t
0.0076260566711425781
>>> t = time.time() ; T = [l in H2 for l in H] ; time.time() - t
2.4915561676025391
>>> (Z == numpy.array(T)).all()
True
>>> H[Z]
array(['tlz', 'ntm', 'mza', 'rtj', 'dfu', 'rxz', 'jas', 'aei', 'vnk',
       'lod', 'vse', 'xlt', 'klj', 'xht', 'trz', 'jun', 'upo', 'pfh',
       'nlp', 'ztl', 'wpy', 'oxn', 'zbc', 'klg', 'tdv', 'rys', 'qpn',
       'psk', 'vwz', 'wkp', 'gco', 'kak', 'cdc', 'lvg', 'dbd', 'flh',
       'css', 'tlz', 'vsy', 'xha', 'vpz', 'tdu', 'otj', 'wtz', 'qvc',
       'uud', 'iuu', 'cej', 'qyj', 'kxa', 'ixy', 'gag', 'mxo'],
      dtype='|S3')

recarrayisin

tabular.fast.recarrayisin() is the analog of isin for NumPy recarrays.

>>> L = numpy.array(['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'])
>>> [V1,V2,V3] = [numpy.random.randint(26,size=(10000,)),numpy.random.randint(26,size=(10000,)),numpy.random.randint(26,size=(10000,))]
>>> H = numpy.rec.fromrecords([(numpy.random.randint(10),''.join(a)) for a in zip(L[V1],L[V2],L[V3])],names = ['Number','Letters'])
>>> [V1,V2,V3] = [numpy.random.randint(26,size=(10000,)),numpy.random.randint(26,size=(10000,)),numpy.random.randint(26,size=(10000,))]
>>> H2 = numpy.rec.fromrecords([(numpy.random.randint(10),''.join(a)) for a in zip(L[V1],L[V2],L[V3])],names = ['Number','Letters'])
>>> Z = fast.recarrayisin(H,H2)
>>> t = time.time() ; Z = fast.isin(H,H2) ; time.time() - t
0.25945591926574707
>>> t = time.time() ; T = [l in H2 for l in H] ; time.time() - t
4.2678370475769043
>>> (Z == numpy.array(T)).all()
True
>>> print H[Z][:10]
[(4, 'xpc') (7, 'umr') (1, 'gmn') (6, 'qub') (6, 'fdz') (6, 'nht')
 (5, 'uts') (2, 'nlt') (6, 'xzv') (3, 'jae')]

Equality Testing

equalspairs

Given one NumPy array X and another sorted numpy array Y, tabular.fast.equalspairs() determines the indices in Y which are equal to indices in X.

>>> N = 10000
>>> Y = numpy.random.randint(0,10,size=(N,))
>>> Y.sort()
>>> X = numpy.random.randint(0,10,size=(N,))
>>> t = time.time() ; [A,B] = equalspairs(X,Y) ; print time.time() - t
0.00613594055176

For each i = 0, ...., N, the values A[i] and B[i] represent, respectively, the positions in B of the first and last appearances of the value X[i], so that:

Y[A[i]:B[i]] = Y[Y == X[i]]

For comparison, a (purer) Python approach to this computation:

>>> t = time.time() ; C = numpy.array([min((Y == k).nonzero()[0]) for k in X]) ; print time.time() - t
3.72342395782
>>> t = time.time() ; D = numpy.array([1 + max((Y == k).nonzero()[0]) for k in X]) ; time.time() - t
3.7644798755645752
>>> (A == C).all() and (B == D).all()
True

recarrayequalspairs

tabular.fast.recarrayequalspairs() is analogous to tabular.fast.equalspairs(), but works on recarrays. It is slightly different, of course, because the concept of being sorted is less well-defined for a record array. Specifically, on record arrays X and Y, tabular.fast.recarrayequalspairs() returns a triple:

[A,B,s] = fast.recarraypairs(X,Y)

where s is a permutation of Y such that for

Z = Y[s]

we have

Z[A[i]:B[i]] = Z[Z == X[i]]

“Uniqification”

It is often useful to obtain the unique instance of elements in an array.

The function tabular.utils.uniqify() is a fast way to do this, with the additional nice property that it retains the order of the original array.

>>> A = numpy.random.randint(100000,size = (1000000,))
>>> t = time.time() ; B = numpy.array(tb.uniqify(A)) ; print time.time() - t
0.492871046066

If retaining order is not important, the function tabular.fast.arrayuniqify() is even faster.

>>> t = time.time() ; [D,s] = fast.arrayuniqify(A) ; print time.time() - t
0.157983064651
>>> C = A[s][D]
>>> (numpy.sort(B) == C).all()
True

tabular.fast.recarrayuniqify() is a record-array version of tabular.fast.arrayuniqify().

Solution to PathAlong

Let’s return to the problem mentioned at the beginning of this section. Suppose we have two lists of file paths, and we want to know which elements of the first list are “above” elements of the second list. The following algorithm uses tabular.fast.isin() to solve this problem efficiently, the basic strategy being to break up the problem into a set of problems, one for each level of the file hierarchy:

def getpathalong(Y,Z):

        Z = Z.copy() ; Z.sort()
        Y = numpy.array([y + '/' if y[-1] != '/' else y for y in Y])
        Z = numpy.array([y + '/' if y[-1] != '/' else y for y in Z])
        SlashList = numpy.array([len(y.split('/')) - 1 for y in Y])
        Max = max(SlashList) ;  Min = min(SlashList)
        W = numpy.zeros(len(Y),int)

        for i in range(Min,Max+1):
                T = numpy.array(['/'.join(z.split('/')[:i]) + ('/' if len(z.split('/')) > i else '')  for z in Z ])
                L = SlashList == i
                W[L] = fast.isin(Y[L],T)

        return W
>>> F1 = numpy.array(['../Data/Dan_Data/', '../Users/DanYamins/','../Users/ElaineAngelino/','../Data'])
>>> F2 = numpy.array(['../Data/Dan_Data/NPR_Puzzle_Solutions','../Data/Dan_Data/NPR_Puzzle_Solutions/AmericaPensacolaPuzzle/', '../Data/Dan_Data/RandomData','../Users/DanYamins/Finance/'])
>>> test.getpathalong(F1,F2)
array([1, 1, 0, 1])

Of course, it is useful sometimes not only to have which elements of the first list are “above” elements of the second but also the indices in the second list of items they’re above. This algorithm solves that problem:

def getpathalongpairs(Y,Z):

        s = Z.argsort() ; Z = Z[s]
        Y = numpy.array([y + '/' if y[-1] != '/' else y for y in Y])
        Z = numpy.array([y + '/' if y[-1] != '/' else y for y in Z])
        SlashList = numpy.array([len(y.split('/')) - 1 for y in Y])
        Max = max(SlashList) ;  Min = min(SlashList)
        W = numpy.zeros(len(Y),int) ; U = numpy.zeros(len(Y),int)

        for i in range(Min,Max+1):
                T = numpy.array(['/'.join(z.split('/')[:i]) + ('/' if len(z.split('/')) > i else '')  for z in Z ])
                L = SlashList == i
                [W[L],U[L]] = fast.equalspairs(Y[L], T)
        return [W,U,s]

Using the same test data as above:

>>> [A,B,s] = test.getpathalongpairs(F1,F2)
>>> A
array([0, 3, 0, 0])
>>> B
array([0, 3, 4, 0])

Finally, let’s solve a seemingly more difficult problem. Suppose we have a list of pairs of file paths, like so:

>>> Recs = [('Data/Sports/Analysis/Raw1','Data/Sports/Analysis/Computations/'), ('Data/Sports/Analysis/Raw1','Data/Sports/Analysis/Computations/Analysis1.csv'),('Data/Sports/Analysis/Raw1','Data/Sports/Analysis/Computations/Analysis2.csv'),('Data/Sports/Analysis/Raw2','Data/Sports/Analysis/Computations/Analysis3.csv'),('Data/Sports/Analysis/Raw2','Data/Sports/Analysis/Computations/')]
>>> Recs.sort()
>>> Data = numpy.rec.fromrecords(Recs,names=['Input','Output'])

Now, suppose what we want to know is: which records in Data[‘Output’] are strictly above, in the file-system sense, other recods in the same column, but have the save value of Data[‘Input’]? Here is an efficient solution, using what we’ve just learned:

def PathPairSolution(R):

        RR = numpy.array([x + ' ; ' + y for (x,y) in R])
        [A,B,s] = getpathalongpairs(RR,RR)
        [C,D] = fast.equalspairs(RR,RR)

        return zip(D,B)
>>> PathPairSolution(Data)
[(1, 3), (2, 2), (3, 3), (4, 5), (5, 5)]

Table Of Contents

Previous topic

Column hierarchies and the coloring attribute

Next topic

Web & HTML

This Page