QuickStart

Import tabular like this:

>>> import tabular as tb

Making Tabular Arrays

The class tabular.tabarray.tabarray is a data structure designed to hold and manipulate tabular data. tabarray objects are always created like this:

X = tb.tabarray(some data...)

Calling tb.tabarray always returns an instance of the tabular.tabarray.tabarray class, and such tabarray objects can be created with a variety of types of input data.

The basic idea is that your input data will often be a native Python object (like a list of tuples) or a text file (like a CSV). By putting this data into a tabarray object, you’ll get a representation of the data that has powerful row and column selection, filtering, analysis, and manipulation properties with a very simple syntax.

Row data: For instance, suppose your data is a Python list of tuples, each element of which represents a row:

>>> Recs = [('North', 'Service', 44.497368897699261, 85937),
 ('North', 'Manufacturing', 88.752549661587452, 305),
 ('North', 'Education', 46.304921839816174, 90568),
 ('North', 'Healthcare', 12.428147250824761, 52493),
 ('South', 'Service', 7.3632321953949864, 85280),
 ('South', 'Manufacturing', 2.35086845930782, 95578),
 ('South', 'Education', 53.272407154429423, 11823),
 ('South', 'Healthcare', 56.68070233253686, 32669),
 ('East', 'Service', 5.9745703176360525, 81417),
 ('East', 'Manufacturing', 93.840942619447858, 26229),
 ('East', 'Education', 78.067752261015244, 62454),
 ('East', 'Healthcare', 19.703894654017773, 63346),
 ('West', 'Service', 32.015715993438917, 72732),
 ('West', 'Manufacturing', 10.914876835171071, 23455),
 ('West', 'Education', 41.714531295305711, 85876),
 ('West', 'Healthcare', 81.458571751676629, 43660)]

Put the data into a tabular array with column names ‘Region’, ‘Sector’, ‘Amount’, ‘Population’, using the “records” and “names” keyword arguments:

>>> X = tb.tabarray(records = Recs, names=['Region','Sector','Amount','Population'])
>>> X
tabarray([('North', 'Service', 44.497368897699261, 85937),
       ('North', 'Manufacturing', 88.752549661587452, 305),
       ('North', 'Education', 46.304921839816174, 90568),
       ('North', 'Healthcare', 12.428147250824761, 52493),
        .  .  .
       ('West', 'Manufacturing', 10.914876835171071, 23455),
       ('West', 'Education', 41.714531295305711, 85876),
       ('West', 'Healthcare', 81.458571751676629, 43660)],
      dtype=[('Region', '|S5'), ('Sector', '|S13'), ('Amount', '<f8'), ('Population', '<i4')])

Column data: If your data is a list of columns, like this:

>>> Cols = [('North', 'North', 'North', 'North', 'South', 'South', 'South', 'South',
 'East', 'East', 'East', 'East', 'West', 'West', 'West', 'West'),
 ('Service', 'Manufacturing', 'Education', 'Healthcare', 'Service',
 'Manufacturing', 'Education', 'Healthcare', 'Service', 'Manufacturing',
 'Education', 'Healthcare', 'Service', 'Manufacturing', 'Education', 'Healthcare'),
 (81.749690102163314, 74.023848861551798, 32.515259350801131, 49.606447425664648,
 11.748647385553801, 37.258888460591535, 34.364104643975423, 7.0963777523329608,
 75.359680733433933, 46.357633526256734, 68.819758428065498, 67.493509469400962,
 19.816892327675639, 55.942915632272758, 18.294195106203382, 55.55884722919324),
 (99896, 83758, 9808, 60760, 60957, 99498, 74334, 66600, 62984, 5519,
 96958, 7764, 2987, 88673, 7026, 32438)]

Then use the “columns” keyword argument to create a tabular array:

>>> X = tb.tabarray(columns = Cols, names=['Region','Sector','Amount','Population'])
>>> X
tabarray([('North', 'Service', 44.497368897699261, 85937),
       ('South', 'Service', 7.3632321953949864, 85280),
       ('East', 'Service', 5.9745703176360525, 81417),
       ('West', 'Service', 32.015715993438917, 72732)],
      dtype=[('Region', '|S5'), ('Sector', '|S7'), ('Amount', '<f8'), ('Population', '<i4')])

Tabular arrays can also be created from NumPy arrays by using the “array” keyword argument, and a variety of other ways (see Creating Tabular Arrays for more details.)

Length and data type

The length of a tabular array is its number of records:

>>> len(X)                                                      #number of records
16

Tabarrays have associated dtype objects, containing the column names and the data type of each column:

>>> X.dtype
dtype([('Region', '|S5'), ('Sector', '|S13'), ('Amount', '<f8'), ('Population', '<i4')])
>>> len(X.dtype)                                                #number of columns
4
>>> X.dtype.names                                               #column names
('Region', 'Sector', 'Amount', 'Population')

Selection and Filtering

Columns: Individual columns or groups of columns can be selected:

>>> X['Amount']                                                 #select 'Amount' column
array([ 44.4973689 ,  88.75254966,  46.30492184,  12.42814725,
         7.3632322 ,   2.35086846,  53.27240715,  56.68070233,
         5.97457032,  93.84094262,  78.06775226,  19.70389465,
        32.01571599,  10.91487684,  41.7145313 ,  81.45857175])
>>> X[['Amount','Population']]                                  #select both 'Amount' and 'Population' columns
tabarray([(44.497368897699261, 85937), (88.752549661587452, 305), (46.304921839816174, 90568),
       .  .  .
       (32.015715993438917, 72732), (10.914876835171071, 23455),
       (41.714531295305711, 85876), (81.458571751676629, 43660)],
      dtype=[('Amount', '<f8'), ('Population', '<i4')])

Rows: Individual rows or sets of rows can be selected:

>>> X[0]                                                        #select row 0
 ('North', 'Service', 44.497368897699261, 85937)
>>> X[4:9]                                                      #select rows 4 through 8
[('South', 'Service', 7.3632321953949864, 85280)
 ('South', 'Manufacturing', 2.35086845930782, 95578)
 ('South', 'Education', 53.272407154429423, 11823)
 ('South', 'Healthcare', 56.68070233253686, 32669)
 ('East', 'Service', 5.9745703176360525, 81417)]
>>> X[[1,5,9,14]]                                               #select rows 1,5,9,and 14
[('North', 'Manufacturing', 88.752549661587452, 305)
 ('South', 'Manufacturing', 2.35086845930782, 95578)
 ('East', 'Manufacturing', 93.840942619447858, 26229)
 ('West', 'Education', 41.714531295305711, 85876)]

Comparisons: Rich, row-by-row comparisons can be made for columns, returning boolean arrays.

>>> X['Sector'] == 'Manufacturing'                              #compare to a scalar constant value
array([False,  True, False, False, False,  True, False, False, False,
        True, False, False, False,  True, False, False], dtype=bool)
>>> X['Amount'] > .0001*X['Population']                         #compare one column to another array of the same length
array([ True,  True,  True,  True, False, False,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True], dtype=bool)

Filtering: The resulting boolean arrays can be used to select rows:

>>> print X[X['Sector'] == 'Manufacturing']                     #get rows where "Sector" entry is "Manufacturing"
[('North', 'Manufacturing', 88.752549661587452, 305)
 ('South', 'Manufacturing', 2.35086845930782, 95578)
 ('East', 'Manufacturing', 93.840942619447858, 26229)
 ('West', 'Manufacturing', 10.914876835171071, 23455)]
>>> print X[X['Amount'] > 50]                                   #get rows where "Amount" entry exceeds 50
[('North', 'Manufacturing', 88.752549661587452, 305)
 ('South', 'Education', 53.272407154429423, 11823)
 ('South', 'Healthcare', 56.68070233253686, 32669)
 ('East', 'Manufacturing', 93.840942619447858, 26229)
 ('East', 'Education', 78.067752261015244, 62454)
 ('West', 'Healthcare', 81.458571751676629, 43660)]

Logical connectives ‘AND’ (&) and ‘OR’ (|) can be used to combine boolean arrays for more sophisticated filtering:

>>> X[(X['Amount'] > 50) & (X['Sector'] == 'Manufacturing')]    #select rows where "Amount" > 50 and "Sector" = "Manufacturing"
[('North', 'Manufacturing', 88.752549661587452, 305)
 ('East', 'Manufacturing', 93.840942619447858, 26229)]
>>> X[((X['Amount'] > 50) & (X['Sector'] == 'Manufacturing')) | (X['Population'] > 90000)]
[('North', 'Manufacturing', 88.752549661587452, 305)
 ('North', 'Education', 46.304921839816174, 90568)
 ('South', 'Manufacturing', 2.35086845930782, 95578)
 ('East', 'Manufacturing', 93.840942619447858, 26229)]

Reassignment: The contents of a tabarray can be changed by assignment, both for individual values:

>>> print X['Amount'][0]
44.497
>>> X['Amount'][0] = 51.004
>>> print X['Amount']
51.004

as well as whole columns:

>>> X['Amount'] = X['Amount'] + 10
>>> X['Amount']
array([  61.004     ,   98.75254966,   56.30492184,   22.42814725,
         17.3632322 ,   12.35086846,   63.27240715,   66.68070233,
         15.97457032,  103.84094262,   88.06775226,   29.70389465,
         42.01571599,   20.91487684,   51.7145313 ,   91.45857175])

Row and column selection features can be combined:

>>> X['Amount'][X['Sector'] == 'Manufacturing']                 #get "Amount" entries for rows where "Sector" entry is "Manufacturing"
array([  98.75254966,   12.35086846,  103.84094262,   20.91487684])

and also combined with assignments:

>>> X['Amount'][X['Sector'] == 'Manufacturing'] = X['Amount'][X['Sector'] == 'Manufacturing'] - 10              #add 10 to all values selected above
>>> X['Amount']
array([ 61.004     ,  88.75254966,  56.30492184,  22.42814725,
        17.3632322 ,   2.35086846,  63.27240715,  66.68070233,
        15.97457032,  93.84094262,  88.06775226,  29.70389465,
        42.01571599,  10.91487684,  51.7145313 ,  91.45857175])

For more information about selection and filtering, see Manipulating Tabular Arrays: Indexing and Filtering.

Matrix Math

Numerical columns can be “extracted” into arrays on which NumPy mathematical operations can be performed:

>>> X[['Amount','Population']].extract()
array([[  6.10040000e+01,   8.59370000e+04],
       [  8.87525497e+01,   3.05000000e+02],
       [  5.63049218e+01,   9.05680000e+04],
       [  2.24281473e+01,   5.24930000e+04],
       [  1.73632322e+01,   8.52800000e+04],
       [  2.35086846e+00,   9.55780000e+04],
       [  6.32724072e+01,   1.18230000e+04],
       [  6.66807023e+01,   3.26690000e+04],
       [  1.59745703e+01,   8.14170000e+04],
       [  9.38409426e+01,   2.62290000e+04],
       [  8.80677523e+01,   6.24540000e+04],
       [  2.97038947e+01,   6.33460000e+04],
       [  4.20157160e+01,   7.27320000e+04],
       [  1.09148768e+01,   2.34550000e+04],
       [  5.17145313e+01,   8.58760000e+04],
       [  9.14585718e+01,   4.36600000e+04]])
>>> print 3*X[['Amount','Population']].extract().transpose() + 2
[[  1.85012000e+02   2.68257649e+02   1.70914766e+02   6.92844418e+01
    5.40896966e+01   9.05260538e+00   1.91817221e+02   2.02042107e+02
    4.99237110e+01   2.83522828e+02   2.66203257e+02   9.11116840e+01
    1.28047148e+02   3.47446305e+01   1.57143594e+02   2.76375715e+02]
 [  2.57813000e+05   9.17000000e+02   2.71706000e+05   1.57481000e+05
    2.55842000e+05   2.86736000e+05   3.54710000e+04   9.80090000e+04
    2.44253000e+05   7.86890000e+04   1.87364000e+05   1.90040000e+05
    2.18198000e+05   7.03670000e+04   2.57630000e+05   1.30982000e+05]]

There are a huge number of mathematical operations provided by NumPy (see Numpy Functions for a partial listing).

Spreadsheet-Style Operations

Tabarrays support a variety of “spreadsheet-style” operations, including row/column manipulations, replacements, aggregation, pivoting and joins.

Row & Column Operations:

>>> Y = X.deletecols(['Population'])                            #create new array with "Population" column deleted
>>> Y.dtype.names
('Region', 'Sector', 'Amount')
>>> Y = X.addrecords(('West','Surfing',10,10000))               #create new array with additional row
>>> print Y[-2:]
[('West', 'Healthcare', 91.458571751676629, 43660)
 ('West', 'Surfing', 10.0, 10000)]
>>> Importance = numpy.random.randint(10,size=(16,))
>>> Y = X.addcols([Importance],names=['Importance'])            #create new array with new "Importance" column
>>> Y.dtype.names
('Region', 'Sector', 'Amount', 'Population', 'Importance')
>>> Y['Importance']
array([3, 9, 1, 3, 4, 8, 0, 9, 1, 7, 6, 6, 7, 4, 3, 6])

The “rowstack” method concatenates arrays with identical column names and data types along the “vertical” (row) axis:

>>> Y = X.rowstack([X,X,X])                                     #create new array by concatenating X to itself 3 times
>>> len(Y)
64
>>> Y[0]==Y[16]==Y[32]==Y[48]
True

The “colstack” method concatenates arrays with disjoint column names but identical lengths along the “horizontal” (column) axis:

>>> X[['Region','Sector']].colstack([X[['Amount']]])
tabarray([('North', 'Service', 11.568675158203657),
       ('North', 'Manufacturing', 78.486819271428658),
       ('North', 'Education', 57.60705040663634),
       ('North', 'Healthcare', 41.550376687594138),
       .  .  .
       ('West', 'Manufacturing', 54.932402758880414),
       ('West', 'Education', 68.727826108077394),
       ('West', 'Healthcare', 81.726623074753874)],
      dtype=[('Region', '|S5'), ('Sector', '|S13'), ('Amount', '<f8')])

For more information, see Manipulating Tabular Arrays: Row and Column Operations.

sort: The “sort” method provides fast in-place spreadsheet style multi-column sorting. (Manipulating Tabular Arrays: Sorting.)

>>> X.sort(order=['Region','Sector'])
>>> X
[('East', 'Education', 23.097478581186881, 78073)
 ('East', 'Healthcare', 10.52554515403321, 6751)
 ('East', 'Manufacturing', 87.72136064848614, 53744)
 ('East', 'Service', 54.896677290366689, 32747)
 ('North', 'Education', 57.60705040663634, 53327)
 ('North', 'Healthcare', 41.550376687594138, 7896)
 ('North', 'Manufacturing', 78.486819271428658, 39440)
 ('North', 'Service', 11.568675158203657, 80672)
 ('South', 'Education', 31.402298595542256, 39531)
 ('South', 'Healthcare', 79.898317595005963, 70297)
 ('South', 'Manufacturing', 84.012379396498261, 77165)
 ('South', 'Service', 62.1631456237237, 70377)
 ('West', 'Education', 68.727826108077394, 30589)
 ('West', 'Healthcare', 81.726623074753874, 63049)
 ('West', 'Manufacturing', 54.932402758880414, 84733)
 ('West', 'Service', 55.281648099477955, 98990)]

replace: The “replace” method does fast in-place replacements. (Replace.)

>>> X.replace('Education','Taxes')
>>> print X
[('North', 'Service', 20.67236621081404, 54249)
 ('North', 'Manufacturing', 32.000394550879761, 24839)
 ('North', 'Taxes', 5.8365701185767671, 97360)
  .  .  .
 ('West', 'Service', 85.240242206812582, 88514)
 ('West', 'Manufacturing', 51.78347036293308, 37708)
 ('West', 'Taxes', 76.319284382925062, 97475)
 ('West', 'Healthcare', 24.449584044579531, 53317)]

aggregate: The “aggregate” method collapses a data set on unique values in columns specified by the “On” argument, using specified aggregation functions along the other columns (default is to sum).

>>> X.aggregate(On=['Region'], AggFuncDict={'Sector':lambda x : 'AllSectors'})  #create array containing sums of 'Amount' and 'Population' in each region
[('East', 'AllSectors', 278.57970406902308, 323710)
 ('North', 'AllSectors', 113.09165272461399, 194045)
 ('South', 'AllSectors', 329.83407710678244, 182472)
 ('West', 'AllSectors', 237.79258099725024, 277014)]

See Manipulating Tabular Arrays: Aggregation for more information.

pivot: The “pivot” method creates a pivot table relative to the specified column name arguments:

>>> Y = X.pivot('Region','Sector')
>>> print Y.dtype.names
('Region', 'Education_Amount', 'Healthcare_Amount', 'Manufacturing_Amount', 'Service_Amount', 'Education_Population', 'Healthcare_Population', 'Manufacturing_Population', 'Service_Population')
>>> Y[['Region','Healthcare_Amount','Healthcare_Population']]
[('East', 67.330421474002065, 84503) ('North', 54.582321844343419, 17597)
 ('South', 96.879266479772284, 78169) ('West', 24.449584044579531, 53317)]

See Manipulating Tabular Arrays: Pivot for more information.

join: The “join” method connects two or more tabarrays that share common “key” column(s):

>>> X1 = X[['Region','Sector','Amount']]
>>> X2 = X[['Region','Sector','Population']]
>>> X1.join(X2)
Inferring keycols to be: ['Region', 'Sector']
tabarray([('East', 'Education', 27.300913534209194, 57645),
       ('East', 'Healthcare', 79.315579226927042, 78858),
       ('East', 'Manufacturing', 11.550256098748601, 55331),
       ('East', 'Service', 79.895071543085521, 47919),
       ('North', 'Education', 74.546045755282648, 79467),
       ('North', 'Healthcare', 45.973679749219876, 15366),
       ('North', 'Manufacturing', 43.03577699412331, 46687),
       ('North', 'Service', 74.066922573923179, 41929),
       ('South', 'Education', 44.298421981780493, 8654),
       ('South', 'Healthcare', 2.5495475747373764, 9317),
       ('South', 'Manufacturing', 96.868543671291746, 72482),
       ('South', 'Service', 33.599182645022822, 25850),
       ('West', 'Education', 11.391185565781647, 62572),
       ('West', 'Healthcare', 56.204438615569586, 51704),
       ('West', 'Manufacturing', 67.868100963799833, 81922),
       ('West', 'Service', 48.381200406027581, 21577)],
      dtype=[('Region', '|S5'), ('Sector', '|S13'), ('Amount', '<f8'), ('Population', '<i4')])

See Manipulating Tabular Arrays: Join for more information.

Saving to / reading from files

Tabular arrays can be saved to various file formats, like Comma-Separated-Value (CSV):

>>> X.saveSV('SampleData.csv')

This writes out the text file ‘SampleData.csv’ with content:

#types=str,str,float,int
#names=Region,Sector,Amount,Population
North,Service,61.003999999999998,85937
North,Manufacturing,88.752549661587452,305
North,Education,56.304921839816174,90568
North,Healthcare,22.428147250824761,52493
South,Service,17.363232195394986,85280
South,Manufacturing,2.3508684593078204,95578
South,Education,63.272407154429423,11823
South,Healthcare,66.680702332536868,32669
East,Service,15.974570317636053,81417
East,Manufacturing,93.840942619447858,26229
East,Education,88.067752261015244,62454
East,Healthcare,29.703894654017773,63346
West,Service,42.015715993438917,72732
West,Manufacturing,10.914876835171071,23455
West,Education,51.714531295305711,85876
West,Healthcare,91.458571751676629,43660

Conversely, tabular arrays can be loaded in from files:

>>> Y = tb.tabarray(SV='SampleData.csv')
>>> all(Y == X)
True

See Input and Output for more information, including on other formats.

Table Of Contents

Previous topic

Tabular

Next topic

The Organization of the Tabular Package

This Page