tabular.spreadsheet

Spreadsheet-style functions for NumPy ndarray with structured dtype or recarray objects:

aggregate, aggregate_in, pivot, addrecords, addcols, deletecols, renamecol, replace, colstack, rowstack, nullvalue

Note that these functions are also wrapped as methods of the tabular tabarray object, which is a subclass of the numpy ndarray.

See Also:

tabular.spreadsheet.aggregate(X, On=None, AggFuncDict=None, AggFunc=None, AggList=None, returnsort=False, KeepOthers=True)

Aggregate a ndarray with structured dtype (or recarray) on columns for given functions.

Aggregate a numpy recarray (or tabular tabarray) on a set of specified factors, using specified aggregation functions.

Intuitively, this function will aggregate the dataset X on a set of columns, whose names are listed in On, so that the resulting aggregate data set has one record for each unique tuples of values in those columns.

The more factors listed in On argument, the “finer” is the aggregation, the fewer factors, the “coarser” the aggregation. For example, if:

On = ['A','B']

the resulting data set will have one record for each unique value of pairs (a,b) in:

X[['A','B']]

The AggFuncDict argument specifies how to aggregate the factors _not_ listed in On, e.g. the so-called Off columns. For example, if

On = [‘A’,’B’]

and C is some other column, then:

AggFuncDict['C']

is the function that will be used to reduce to a single value the (potentially multiple) values in the C column corresponding to unique values in the A, B columns. For instance, if:

AggFuncDict['C'] = numpy.mean

then the result will be that the values in the C column corresponding to a single A, B value will be averaged.

If an Off column is _not_ provided as a key in AggFuncDict, a default aggregator function will be used: the sum function for numerical columns, concatenation for string columns.

Implemented by the tabarray method tabular.tab.tabarray.aggregate().

Parameters

X : numpy ndarray with structured dtype or recarray

The data set to aggregate.

On : list of strings, optional

List of column names in X.

AggFuncDict : dictionary, optional

Dictionary where

  • keys are some (all) column names of X that are NOT in On
  • values are functions that can be applied to lists or numpy arrays.

This specifies how to aggregate the factors _not_ listed in On, e.g. the so-called Off columns.

AggFunc : function, optional

Function that can be applied to lists or numpy arrays, specifying how to aggregate factors not listed in either On or the keys of AggFuncDict, e.g. a “default” aggregation function for the Off columns not explicitly listed in AggFuncDict.

returnsort : Boolean, optional

If returnsort == True, then return a list of indices describing how X was sorted as a result of aggregation. Default value is False.

Returns

agg : numpy ndarray with structured dtype

Aggregated data set.

index_array : numpy ndarray (int, 1D)

Returned only if returnsort == True. List of indices describing how X was sorted as a result of aggregation.

See also:

tabular.spreadsheet.aggregate_in(Data, On=None, AggFuncDict=None, AggFunc=None, AggList=None, interspersed=True)

Aggregate a ndarray with structured dtype or recarray and include original data in the result.

Take aggregate of data set on specified columns, then add the resulting rows back into data set to make a composite object containing both original non-aggregate data rows as well as the aggregate rows.

First read comments for tabular.spreadsheet.aggregate().

This function returns a numpy ndarray, with the number of rows equaling:

len(Data) + len(A)

where A is the the result of:

Data.aggregate(On,AggFuncDict)

A represents the aggregate rows; the other rows were the original data rows.

This function supports _multiple_ aggregation, meaning that one can first aggregate on one set of factors, then repeat aggregation on the result for another set of factors, without the results of the first aggregation interfering the second. To achieve this, the method adds two new columns:

  • a column called “__aggregates__” specifying on which factors the rows that are aggregate rows were aggregated. Rows added by aggregating on factor A (a column in the original data set) will have A in the “__aggregates__” column. When multiple factors A1, A2 , ... are aggregated on, the notation is a comma-separated list: A1,A2,.... This way, when you call aggregate_in again, the function only aggregates on the columns that have the empty char ‘’ in their “__aggregates__” column.
  • a column called ‘__color__’, specifying Gray-Scale colors for aggregated rows that will be used by the Data Environment system browser for colorizing the data. When there are multiple levels of aggregation, the coarser aggregate groups (e.g. on fewer factors) get darker gray color then those on finer aggregate groups (e.g. more factors).

Implemented by the tabarray method tabular.tab.tabarray.aggregate_in().

Parameters

Data : numpy ndarray with structured dtype or recarray

The data set to aggregate in.

On : list of strings, optional

List of column names in X.

AggFuncDict : dictionary, optional

Dictionary where

  • keys are some (all) column names of X that are NOT in On
  • values are functions that can be applied to lists or numpy arrays.

This specifies how to aggregate the factors _not_ listed in On, e.g. the so-called Off columns.

AggFunc : function, optional

Function that can be applied to lists or numpy arrays, specifying how to aggregate factors not listed in either On or the keys of AggFuncDict, e.g. a “default” aggregation function for the Off columns not explicitly listed in AggFuncDict.

interspersed : boolean, optional

  • If True, aggregate rows are interleaved with the data of which they are aggregates.
  • If False, all aggregate rows placed at the end of the array.

Returns

agg : numpy ndarray with structured dtype

Composite aggregated data set plus original data set.

See also:

tabular.spreadsheet.pivot(X, a, b, Keep=None, NullVals=None, order=None, prefix='_')

Implements pivoting on numpy ndarrays (with structured dtype) or recarrays.

See http://en.wikipedia.org/wiki/Pivot_table for information about pivot tables.

Returns X pivoted on (a,b) with a as the row axis and b values as the column axis.

So-called “nontrivial columns relative to b” in X are added as color-grouped sets of columns, and “trivial columns relative to b” are also retained as cross-grouped sets of columns if they are listed in Keep argument.

Note that a column c in X is “trivial relative to b” if for all rows i, X[c][i] can be determined from X[b][i], e.g the elements in X[c] are in many-to-any correspondence with the values in X[b].

The function will raise an exception if the list of pairs of value in X[[a,b]] is not the product of the individual columns values, e.g.:

X[[a,b]] == set(X[a]) x set(X[b])

in some ordering.

Implemented by the tabarray method tabular.tab.tabarray.pivot()

Parameters

X : numpy ndarray with structured dtype or recarray

The data set to pivot.

a : string

Column name in X.

b : string

Another column name in X.

Keep : list of strings, optional

List of other columns names in X.

NullVals : optional

Dictionary mapping column names in X other than a or b to appropriate null values for their types.

If None, then the null values defined by the nullvalue function are used, see tabular.spreadsheet.nullvalue().

prefix : string, optional

Prefix to add to coloring keys corresponding to cross-grouped “trivial columns relative to b”. Default value is an underscore, ‘_’.

Returns

ptable : numpy ndarray with structured dtype

The resulting pivot table.

coloring : dictionary

Dictionary whose keys are strings and corresponding values are lists of column names (e.g. strings).

There are two groups of keys:

  • So-called “nontrivial columns relative to b” in X. These correspond to columns in:

    set(`X.dtype.names`) - set([a, b])
    
  • Cross-grouped “trivial columns relative to b”. The prefix is used to distinguish these.

The coloring parameter is used by the the tabarray pivot method, tabular.tab.tabarray.pivot().

See tabular.tab.tabarray.__new__() for more information about coloring.

tabular.spreadsheet.addrecords(X, new)

Append one or more records to the end of a numpy recarray or ndarray .

Can take a single record, void or tuple, or a list of records, voids or tuples.

Implemented by the tabarray method tabular.tab.tabarray.addrecords().

Parameters

X : numpy ndarray with structured dtype or recarray

The array to add records to.

new : record, void or tuple, or list of them

Record(s) to add to X.

Returns

out : numpy ndarray with structured dtype

New numpy array made up of X plus the new records.

See also: tabular.spreadsheet.rowstack()

tabular.spreadsheet.addcols(X, cols, names=None)

Add one or more columns to a numpy ndarray.

Technical dependency of tabular.spreadsheet.aggregate_in().

Implemented by the tabarray method tabular.tab.tabarray.addcols().

Parameters

X : numpy ndarray with structured dtype or recarray

The recarray to add columns to.

cols : numpy ndarray, or list of arrays of columns

Column(s) to add.

names: list of strings, optional

Names of the new columns. Only applicable when cols is a list of arrays.

Returns

out : numpy ndarray with structured dtype

New numpy array made up of X plus the new columns.

See also: tabular.spreadsheet.colstack()

tabular.spreadsheet.deletecols(X, cols)

Delete columns from a numpy ndarry or recarray.

Can take a string giving a column name or comma-separated list of column names, or a list of string column names.

Implemented by the tabarray method tabular.tab.tabarray.deletecols().

Parameters

X : numpy recarray or ndarray with structured dtype

The numpy array from which to delete columns.

cols : string or list of strings

Name or list of names of columns in X. This can be a string giving a column name or comma-separated list of column names, or a list of string column names.

Returns

out : numpy ndarray with structured dtype

New numpy ndarray with structured dtype given by X, excluding the columns named in cols.
tabular.spreadsheet.renamecol(X, old, new)

Rename column of a numpy ndarray with structured dtype, in-place.

Implemented by the tabarray method tabular.tab.tabarray.renamecol().

Parameters

X : numpy ndarray with structured dtype

The numpy array for which a column is to be renamed.

old : string

Old column name, e.g. a name in X.dtype.names.

new : string

New column name to replace old.
tabular.spreadsheet.replace(X, old, new, strict=True, cols=None, rows=None)

Replace value old with new everywhere it appears in-place.

Implemented by the tabarray method tabular.tab.tabarray.replace().

Parameters

X : numpy ndarray with structured dtype

Numpy array for which in-place replacement of old with new is to be done.

old : string

new : string

strict : boolean, optional

  • If strict = True, replace only exact occurences of old.
  • If strict = False, assume old and new are strings and replace all occurences of substrings (e.g. like str.replace())

cols : list of strings, optional

Names of columns to make replacements in; if None, make replacements everywhere.

rows : list of booleans or integers, optional

Rows to make replacements in; if None, make replacements everywhere.

Note: This function does in-place replacements. Thus there are issues handling data types here when replacement dtype is larger than original dtype. This can be resolved later by making a new array when necessary ...

tabular.spreadsheet.colstack(seq, mode='abort', returnnaming=False)

Horizontally stack a sequence of numpy ndarrays with structured dtypes

Analog of numpy.hstack for recarrays.

Implemented by the tabarray method tabular.tab.tabarray.colstack() which uses tabular.tabarray.tab_colstack().

Parameters

seq : sequence of numpy ndarray with structured dtype

List, tuple, etc. of numpy recarrays to stack vertically.

mode : string in [‘first’,’drop’,’abort’,’rename’]

Denotes how to proceed if when multiple recarrays share the same column name:

  • if mode == first, take the column from the first recarray in seq containing the shared column name.
  • elif mode == abort, raise an error when the recarrays to stack share column names; this is the default mode.
  • elif mode == drop, drop any column that shares its name with any other column among the sequence of recarrays.
  • elif mode == rename, for any set of all columns sharing the same name, rename all columns by appending an underscore, ‘_’, followed by an integer, starting with ‘0’ and incrementing by 1 for each subsequent column.

Returns

out : numpy ndarray with structured dtype

Result of horizontally stacking the arrays in seq.

See also: numpy.hstack.

tabular.spreadsheet.rowstack(seq, mode='nulls', nullvals=None)

Vertically stack a sequence of numpy ndarrays with structured dtype

Analog of numpy.vstack

Implemented by the tabarray method tabular.tab.tabarray.rowstack() which uses tabular.tabarray.tab_rowstack().

Parameters

seq : sequence of numpy recarrays

List, tuple, etc. of numpy recarrays to stack vertically.

mode : string in [‘nulls’, ‘commons’, ‘abort’]

Denotes how to proceed if the recarrays have different dtypes, e.g. different sets of named columns.

  • if mode == nulls, the resulting set of columns is determined by the union of the dtypes of all recarrays to be stacked, and missing data is filled with null values as defined by tabular.spreadsheet.nullvalue(); this is the default mode.
  • elif mode == commons, the resulting set of columns is determined by the intersection of the dtypes of all recarrays to be stacked, e.g. common columns.
  • elif mode == abort, raise an error when the recarrays to stack have different dtypes.

Returns

out : numpy ndarray with structured dtype

Result of vertically stacking the arrays in seq.

See also: numpy.vstack.

tabular.spreadsheet.join(L, keycols=None, nullvals=None, renamer=None, returnrenaming=False, Names=None)

Combine two or more numpy ndarray with structured dtype on common key column(s).

Merge a list (or dictionary) of numpy ndarray with structured dtype, given by L, on key columns listed in keycols.

This function is actually a wrapper for tabular.spreadsheet.strictjoin().

The strictjoin function has a few restrictions, and this join function will try to ensure that they are satisfied:

  • each element of keycol must be a valid column name in X and each array in L, and all of the same data-type.
  • for each column col in keycols, and each array A in L, the values in A[col] must be unique, e.g. no repeats of values – and same for X[col].
  • the non-key-column column names in each of the arrays must be disjoint from each other – or disjoint after a renaming (see below).

An error will be thrown if these conditions are not met.

If you don’t provide a value of keycols, the algorithm will attempt to infer which columns should be used by trying to find the largest set of common column names that contain unique values in each array and have the same data type. An error will be thrown if no such inference can be made.

Renaming of overlapping columns

If the non-keycol column names of the arrays overlap, join will by default attempt to rename the columns by using a simple convention:

  • If L is a list, it will append the number in the list to the key associated with the array.
  • If L is a dictionary, the algorithm will append the string representation of the key associated with an array to the overlapping columns from that array.

You can override the default renaming scheme using the renamer parameter.

Nullvalues for keycolumn differences

If there are regions of the keycolumns that are not overlapping between merged arrays, join will fill in the relevant entries with null values chosen by default:

  • ‘0’ for integer columns
  • ‘0.0’ for float columns
  • the empty character (‘’) for string columns.

Parameters

L : list or dictionary

Numpy recarrays to merge. If L is a dictionary, the keys name each numpy recarray, and the corresponding values are the actual numpy recarrays.

keycols : list of strings

List of the names of the key columns along which to do the merging.

nullvals : function, optional

A function that returns a null value for a numpy format descriptor string, e.g. '<i4' or '|S5'.

See the default function for further documentation:

tabular.spreadsheet.DEFAULT_NULLVALUEFORMAT()

renamer : function, optional

A function for renaming overlapping non-key column names among the numpy recarrays to merge.

See the default function for further documentation:

returnrenaming : Boolean, optional

Whether to return the result of the renamer function.

See the default function for further documentation:

Names: list of strings:

If L is a list, than names for elements of L can be specified with Names (without losing the ordering as you would if you did it with a dictionary).

len(L) must equal len(Names)

Returns

result : numpy ndarray with structured dtype

Result of the join, e.g. the result of merging the input numpy arrays defined in L on the key columns listed in keycols.

renaming : dictionary of dictionaries, optional

The result returned by the renamer function. Returned only if returnrenaming == True.

See the default function for further documentation:

See Also:

tabular.spreadsheet.strictjoin(L, keycols, nullvals=None, renaming=None, Names=None)

Combine two or more numpy ndarray with structured dtypes on common key column(s).

Merge a list (or dictionary) of numpy arrays, given by L, on key columns listed in keycols.

The strictjoin assumes the following restrictions:

  • each element of keycol must be a valid column name in X and each array in L, and all of the same data-type.
  • for each column col in keycols, and each array A in L, the values in A[col] must be unique, e.g. no repeats of values – and same for X[col].
  • the non-key-column column names in each of the arrays must be disjoint from each other – or disjoint after a renaming (see below).

An error will be thrown if these conditions are not met.

For a wrapper that attempts to meet these restrictions, see tabular.spreadsheet.join().

If you don’t provide a value of keycols, the algorithm will attempt to infer which columns should be used by trying to find the largest set of common column names that contain unique values in each array and have the same data type. An error will be thrown if no such inference can be made.

Renaming of overlapping columns

If the non-keycol column names of the arrays overlap, join will by default attempt to rename the columns by using a simple convention:

  • If L is a list, it will append the number in the list to the key associated with the array.
  • If L is a dictionary, the algorithm will append the string representation of the key associated with an array to the overlapping columns from that array.

You can override the default renaming scheme using the renamer parameter.

Nullvalues for keycolumn differences

If there are regions of the keycolumns that are not overlapping between merged arrays, join will fill in the relevant entries with null values chosen by default:

  • ‘0’ for integer columns
  • ‘0.0’ for float columns
  • the empty character (‘’) for string columns.

Parameters

L : list or dictionary

Numpy recarrays to merge. If L is a dictionary, the keys name each numpy recarray, and the corresponding values are the actual numpy recarrays.

keycols : list of strings

List of the names of the key columns along which to do the merging.

nullvals : function, optional

A function that returns a null value for a numpy format descriptor string, e.g. '<i4' or '|S5'.

See the default function for further documentation:

tabular.spreadsheet.DEFAULT_NULLVALUEFORMAT()

renaming : dictionary of dictionaries, optional

Dictionary mapping each input numpy recarray to a dictionary mapping each original column name to its new name following the convention above.

For example, the result returned by:

Returns

result : numpy ndarray with structured dtype

Result of the join, e.g. the result of merging the input numpy arrays defined in L on the key columns listed in keycols.

See Also:

tabular.spreadsheet.DEFAULT_RENAMER(L, Names=None)

Renames overlapping column names of numpy ndarrays with structured dtypes

Rename the columns by using a simple convention:

  • If L is a list, it will append the number in the list to the key associated with the array.
  • If L is a dictionary, the algorithm will append the string representation of the key associated with an array to the overlapping columns from that array.

Default renamer function used by tabular.spreadsheet.join()

Parameters

L : list or dictionary

Numpy recarrays with columns to be renamed.

Returns

D : dictionary of dictionaries

Dictionary mapping each input numpy recarray to a dictionary mapping each original column name to its new name following the convention above.

Previous topic

tabular.io

Next topic

tabular.fast

This Page