tabular.tab

Class and functions pertaining to the tabular.tabarray class.

The tabarray class is a column-oriented hierarchical data object and subclass of numpy.ndarray.

The basic structure of this module is that it contains:

  • The tabarray class.
  • Some helper functions for tabarray. The helper functions are precisely those necessary to wrap functions from the tabular.spreadsheet module that operate on lists of arrays, to handle tabular’s additional structure. These functions are named with the convention “tab_FNAME”, e.g. “tab_rowstack”, “tab_join” &c. The functions in tabular.spreadsheet that only take a single array are all wrapped JUST as methods of tabarray, and not as separate functions.
class tabular.tab.tabarray

Bases: numpy.ndarray

Subclass of the numpy ndarray with extra structure and functionality.

tabarray is a column-oriented data object based on the numpy ndarray with structured dtype, with added functionality and ability to define named groups of columns.

tabarray supports several i/o methods to/from a number of file formats, including (separated variable) text (e.g. .txt, .tsv, .csv), numpy binary (.npz) and hierarchical separated variable (.hsv).

Added functionality includes spreadsheet style operations such as “pivot”, “aggregate” and “replace”.

See docstring of the tabarray.__new__ method, or the Tabular reference documentation, for data on constructing a tabarrays.

static __new__(subtype, array=None, records=None, columns=None, SVfile=None, binary=None, HSVfile=None, HSVlist=None, shape=None, dtype=None, formats=None, names=None, titles=None, aligned=False, byteorder=None, buf=None, offset=0, strides=None, comments=None, delimiter=None, lineterminator='n', escapechar=None, quoting=None, quotechar=None, doublequote=True, skipinitialspace=False, skiprows=0, uselines=None, usecols=None, excludecols=None, toload=None, metametadata=None, kvpairs=None, namesinheader=True, headerlines=None, valuefixer=None, linefixer=None, colfixer=None, delimiter_regex=None, coloring=None, inflines=2500, wrap=None, typer=None, missingvalues=None, fillingvalues=None, renamer=None, verbosity=5)

Unified constructor for tabarrays.

Specifying the data:

Data can be passed to the constructor, or loaded from several different file formats.

array : two-dimensional arrays (numpy.ndarray)

>>> import numpy
>>> x = numpy.array([[1, 2], [3, 4]])
>>> tabarray(array=x)
tabarray([(1, 2), (3, 4)], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])

See also: numpy.rec.fromrecords

records : python list of records (elemets can be tuples or lists)

>>> tabarray(records=[('bork', 1, 3.5), ('stork', 2, -4.0)], names=['x','y','z'])
tabarray([('bork', 1, 3.5), ('stork', 2, -4.0)], 
      dtype=[('x', '|S5'), ('y', '<i4'), ('z', '<f8')])

See also: numpy.rec.fromrecords

columns : list of python lists or 1-D numpy arrays

Fastest when passed a list of numpy arrays, rather than a list of lists.

>>> tabarray(columns=[['bork', 'stork'], [1, 2], [3.5, -4.0]], names=['x','y','z']) 
tabarray([('bork', 1, 3.5), ('stork', 2, -4.0)], 
      dtype=[('x', '|S5'), ('y', '<i4'), ('z', '<f8')])

kvpairs : list of list of key-value pairs

For loading key-value pairs (e.g. as from an XML file). Missing values can be specified using the fillingvalues argument.

See also: numpy.rec.fromarrays

SVfile : string

File path to a separated variable (CSV) text file. Load data from a CSV by calling:

tabular.io.loadSV(SVfile, comments, delimiter, 
lineterminator, skiprows, usecols, metametadata, 
namesinheader, valuefixer, linefixer)

See also: saveSV(), tabular.io.loadSV()

binary : string

File path to a binary file. Load a .npz binary file created by the savebinary() by calling:

tabular.io.loadbinary(binary)

which uses numpy.load().

See also: savebinary(), tabular.io.loadbinary()

HSVfile : string

File path to a hierarchical separated variable (.hsv) directory, or a comma separated variable (.csv) text file inside of a HSV directory corresponding to a single column of data. Load a structured directory or single file defined by the saveHSV() method by calling:

tabular.io.loadHSV(HSVfile, toload)

See also: saveHSV(), tabular.io.loadHSV(), tabular.io.loadHSVlist()

HSVlist : list of strings

List of file paths to hierarchical separated variable (.hsv) files and/or individual comma separated variable (.csv) text files inside of HSV directories, all with the same number of records. Load a list of file paths created by the saveHSV() method by calling:

tabular.io.loadHSVlist(HSVlist)

See also: saveHSV(), tabular.io.loadHSV(), tabular.io.loadHSVlist()

Additional parameters:

names : list of strings

Sets the names of the columns of the resulting tabarray. If not specified, names value is determined first by looking for metadata in the header of the file, and if that is not found, are assigned by NumPy’s f0, f1, ... fn convention. See namesinheader parameter below.

formats : string or list of strings

Sets the datatypes of the columns. The value of formats can be a list or comma-delimited string of values describing values for each column (e.g. “str,str,int,float” or [“str”, “str”, “int”, “float”]), a single value to apply to all columns, or anything that can be used in numpy.rec.array constructor.

If the formats (or dtype) parameter are not specified, typing is done by inference. (See also typer parameter below).

dtype : numpy dtype object

Sets the numpy dtype of the resulting tabarray, combining column format and column name information. If dtype is set, any names and formats specifications will be overriden. If the dtype (or formats) parameter are not specified, typing is done by inference. (See also typer parameter below).

The names, formats and dtype parameters duplicate parameters of the NumPy record array creation inferface. Additional paramters of the NumPy inferface that are passed through are shape, titles, byteorder and aligned (see NumPy documentation for more information.)

delimiter : single-character string

When reading text file, character to use as delimiter to split fields. If not specified, the delimiter is determined first by looking for special-format metadata specifying the delimiter, and then if no specification is found, attempts are made to infer delimiter from file contents. (See inflines parameter below.)

delimiter_regex : regular expression (compiled or in string format)

Regular expression to use to recognize delimiters, in place of a single character. (For instance, to have whitespace delimiting, using delimiter_regex = ‘[s*]+’ )

lineterminator : single-character string

Line terminator to use when reading in using SVfile
skipinitialspace : boolean
If true, strips whitespace following the delimiter from field.

The delimiter, linterminator and skipinitialspace parameters are passed on as parameters to the python CSV module, which is used for reading in delimited text files. Additional parameters from that interface that are replicated in this constructor include quotechar, escapechar, quoting, doublequote and dialect (see CSV module documentation for more information.)

skiprows : non-negative integer, optional

When reading from a text file, the first skiprows lines are ignored. Default is 0, e.g no rows are skipped.

uselines : pair of non-negative integer, optional

When reading from a text file, range of lines of data to load. (In constrast to skiprows, which specifies file rows to ignore before looking for header information, uselines specifies which data (non-header) lines to use, after header has been striped and processed.) See headerlines below.

usecols : sequence of non-negative integers or strings, optional

When reading from a text file, only the columns in usecols are loaded and processed. Columns can be described by number, with 0 being the first column; or if name metadata is present, then by name ; or, if color group information is present in the file, then by color group name. (Default is None, e.g. all columns are loaded.)

excludecols : sequence of non-negative integers or strings, optional

Converse of usecols, e.g. all columns EXCEPT those listed will be loaded.

comments : single-character string, optional

When reading from a text file, character used to distinguish header lines. If specified, any lines beginning with this character at the top of the file are assumed to contain header information and not row data.

headerlines : integer, optional

When reading from a text file, the number of lines at the top of the file (after the first skiprows lines) corresponding to the header of the file, where metadata can be found. Lines after headerlines are assumed to contain row contents. If not specified, value is determined first by looking for special metametadata in first line of file (see Tabular reference documentation for more information about this), and if no such metadata is found, is inferred by looking at file contents.

namesinheader : Boolean, optional

When reading from a text file, if namesinheader == True, then assume the column names are in the last header line (unless overridden by existing metadata or metametadata directive). Default is True.

linefixer : callable, optional

When reading from a text file, this callable is applied to every line in the file. This option is passed on all the way to the call to io.loadSVrecord function, and is applied directly to the strings in the file, after they’re split in lines but before they’re split into fields or any typing is done. The purpose is to make lines with errors or mistakes amenable to delimiter inference and field-splitting.

valuefixer : callable, or list or dictionary of callables, optional

When reading from a text file, these callable(s) are applied to every value in each field. The application is done after line strings are loaded and split into fields, but before any typing or missing-value imputation is done. The purpose of the valuefixer is to prepare column values for typing and imputation. The valuefixer callable can return a string or a python object. If valuefixer is a single callable, then that same callable is applied to values in all column; if it is a dictionary, then the keys can be either numbers or names and the value for the key will be applied to values in the corresponding column with that name or number; if it is a list, then the list elements must be in 1-1 correponsdence with the loaded columns, and are applied to each respectively.

colfixer : callable, or list or dictionary of callables, optional

Same as valuefixer, but instead of being applied to individual values, are applied to whole columns (and must return columns or numpy arrays of identical length). Like valuefixer, colfixer callable(s) are applied before typing and missing-value imputation.

missingvalues : string, callable returning string, or list or dictionary of strings or string-valued callable

When reading from text file, string value to consider as “missing data” and to be replaced before typing is done. If specified as a callable, the callable will be applied to the column(s) to determine missing value. If specified as a dictionary, keys are expected to be numbers of names of columns, and values are individual missing values for those columns (like valuefixer inferface).

fillingvalues : string, pair of strings, callable returning string, or list or dictionary of strings or string-valued callable

When reading from text file, values to be used to replace missing data before typing is done. If specified as a single non-callable, non-tuple value, this value is used to replace all missing data. If specified as a callable, the callable is applied to the column and returns the fill value (e.g. to allow the value to depend on the column type). If specified as a pair of values, the first value acts as the missing value and the second as the value to replace with. If a dictionary or list of values, then values are applied to corresponding columns.

NOTE: all the missingvalues and fillingvalues functionalities can be replicated (and generalized) using the valuefixer or colfixer parameters, by specifying function(s) which identify and replace missing values. While more limited, using missingvalues and fillingvalues interface is easier and gives better performance.

typer : callable taking python list of strings (or other values) and returning 1-dnumpy array ; or list dictionary of such callables

Function used to infer type and convert string lists into typed numpy arrays, if no format information has been provided. When applied at all, this function is applied after string have been loaded and split into fields. This function is expected to impute missing values as well, and will override any setting of missingvalues or fillingvalues. If a callable is passed, it is used as typer for all columns, while if a dictionary (or list) of callables is passed, they’re used on corresponding columns. If needed (e.g. because formatting information hasn’t been supplied) but typer isn’t specified (at least, for a given column), the constructor defaults to using the utils.DEFAULT_TYPEINFERER function.

inflines : integer, optional

Number of lines of file to use as sample data when inferring delimiter and header.

metametadata : dictionary of integers or pairs of integers

Specifies supplementary metametadata information for use with SVfile loading. See Tabular reference documentation for more information

coloring: dictionary

Hierarchical column-oriented structure.
  • Colorings can be passed as argument:
    • In the coloring argument, pass a dictionary. Each key is a string naming a color whose corresponding value is a list of column names (strings) in that color.
    • If colorings are passed as argument, they override any colorings inferred from the input data.
  • Colorings can be inferred from the input data:
    • If constructing from a .hsv directory, colorings will be automatically inferred from the directory tree.
    • If constructing from a CSV file (e.g. .tsv, .csv) created by saveSV(), colorings are automatically parsed from the header when present.
    • If constructing from a numpy binary file (e.g. .npz) created by savebinary(), colorings are automatically loaded from a binary file (coloring.npy) in the .npz directory.

wrap: string

Adds a color with name wrap listing all column names. (When this tabarray is saved to a .hsv directory, all columns will be nested in an additional directory, wrap.hsv.)

verbosity : integer, optional

Sets how much detail from messages will be printed.

Special column names:

Column names that begin and end with double underscores, e.g. ‘__column_name__’ are used to hold row-by-row metadata and specify arbitrary higher-level groups of rows, in analogy to how the coloring attribute specifies groupings of columns.

One use of this is for formatting and communicating “side” information to other tabarray methods. For instance:

  • A ‘__color__’ column is interpreted by the
tabular.web.tabular2html function to specify row color in making html representations of tabarrays. It is expected in each row to contain a web-safe hex triplet color specification, e.g. a string of the form ‘#XXXXXX’ (see http://en.wikipedia.org/wiki/Web_colors).
  • The ‘__aggregates__’ column is used to disambiguate rows that are aggregates of data in other sets of rows for the .aggregate_in method (see comments on that method).
__array_finalize__(obj)

Set default attributes (e.g. coloring) if obj does not have them.

Note: this is called when you view a numpy ndarray as a tabarray.

__getitem__(ind)

Returns a subrectangle of the table.

The representation of the subrectangle depends on type(ind). Also, whether the returned object represents a new independent copy of the subrectangle, or a “view” into this self object, depends on type(ind).

  • If you pass the name of an existing coloring, you get a tabarray consisting of copies of columns in that coloring.
  • If you pass a list of existing coloring names and/or column names, you get a tabarray consisting of copies of columns in the list (name of coloring is equivalent to list of names of columns in that coloring; duplicate columns are deleted).
  • If you pass a numpy.ndarray, you get a tabarray consisting a subrectangle of the tabarray, as handled by numpy.ndarray.__getitem__():
    • if you pass a 1D NumPy ndarray of booleans of len(self), the rectangle contains copies of the rows for which the corresponding entry is True.
    • if you pass a list of row numbers, you get a tabarray containing copies of these rows.
__getslice__()

x.__getslice__(i, j) <==> x[i:j]

Use of negative indices is not supported.

copy()

Return a copy of the tabarray.

Note

This method is actually automatically inherited from the NumPy ndarray, but is explicitly included here to emphasize its utility. This documentation is modified from NumPy’s.

Notes

This is like:

>>> tb.tabarray(array=a, dtype=a.dtype, copy=True)

Examples

Create an array x, with a reference y and a copy z:

>>> x = tb.tabarray(records=[(1,2,3),(4,5,6)])
>>> y = x
>>> z = x.copy()

Note that, when we modify x, y changes, but not z:

>>> x[0] = (0,0,0)
>>> x[0] == y[0]
True
>>> x[0] == z[0]
False
tolist()

Return the array as a possibly nested list.

Return a copy of the array data as a (nested) Python list. Data items are converted to the nearest compatible Python type.

Note

This method is actually automatically inherited from the NumPy ndarray, but is explicitly included here to emphasize its utility. This documentation is modified from NumPy’s.

Returns

y : list

The possibly nested list of array elements.

Notes

The array may be recreated, a = tb.tabarray(records=a.tolist()).

Examples

>>> a = tb.tabarray(records=[('a', 2), ('c', 1)])
>>> list(a)
[('a', 2), ('c', 1)]
>>> type(list(a)[0])
<class 'numpy.core.records.record'>
>>> a.tolist()
[('a', 2), ('c', 1)]
>>> atype(a.tolist()[0])
<type 'tuple'>
sort(kind='quicksort', order=None)

Sort an array, in-place.

Note

This method is actually automatically inherited from the NumPy ndarray, but is explicitly included here to emphasize its utility. This documentation is modified from NumPy’s.

Parameters

kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, optional

Sorting algorithm. Default is ‘quicksort’.

order : string or list, optional

This argument specifies which fields to compare first, second, and so on. This can be a string corresponding to a single column name, or a list of column names. This list does not need to include all of the column names.

See Also

numpy.sort : Return a sorted copy of an array. argsort : Indirect sort. lexsort : Indirect stable sort on multiple keys. searchsorted : Find elements in sorted array.

Notes

See numpy.sort for notes on the different sorting algorithms.

Examples

Use the order keyword to specify a column name (or list of columns) to use:

>>> a = tabarray(records=[('a', 2), ('c', 1)], names=['x', 'y'])
>>> a.sort(order='y')
>>> a
tabarray([('c', 1), ('a', 2)],
          dtype=[('x', '|S1'), ('y', '<i4')])
repeat(repeats)

Repeat elements of a tabarray.

Note

This method is actually automatically inherited from the NumPy ndarray, but is explicitly included here to emphasize its utility. This documentation is modified from NumPy’s.

Parameters

repeats : {int, array of ints}

The number of repetitions for each element. repeats is broadcasted to fit the number of records.

Returns

repeated_array : tabarray

Output array which has the same number of columns as the origial tabarray.

See Also

numpy.repeat : function called by this method

Examples

>>> x = tb.tabarray(records=[(1,2),(3,4)])
>>> x.repeat(2)
tabarray([(1, 2), (1, 2), (3, 4), (3, 4)],
          dtype=[('f0', '<i4'), ('f1', '<i4')])
>>> x.repeat([1, 2])
tabarray([(1, 2), (3, 4), (3, 4)],
          dtype=[('f0', '<i4'), ('f1', '<i4')])
put(ind, v, mode='raise')

Changes specific elements of one array by replacing from another array.

Note

This method is actually automatically inherited from the NumPy ndarray, but is explicitly included here to emphasize its utility. This documentation is modified from NumPy’s.

The indexing works on the flattened target array, put is roughly equivalent to:

for i, val in zip(ind, v):
        x.flat[i] = val

Parameters

ind : array_like

Target indices, interpreted as integers.

v : array_like

Values to place in the original array at target indices. If v is shorter than ind it will be repeated as necessary.

mode : {‘raise’, ‘wrap’, ‘clip’}, optional

Specifies how out-of-bounds indices will behave.

  • ‘raise’ – raise an error (default)
  • ‘wrap’ – wrap around
  • ‘clip’ – clip to the range

‘clip’ mode means that all indices that are too large are replaced by the index that addresses the last element along that axis. Note that this disables indexing with negative numbers.

See Also

putmask, place

Examples

>>> x = tb.tabarray(columns=[range(5)])
>>> y = tb.tabarray(columns=[range(10,15)])
>>> x.put([0, 2], y)
>>> x
tabarray([(10,), (1,), (11,), (3,), (4,)],
      dtype=[('f0', '<i4')])
>>> x = tb.tabarray(columns=[range(5)])
>>> y = tb.tabarray(columns=[range(10,15)])
>>> x.put(22, y, mode='clip')
>>> x
tabarray([(0,), (1,), (2,), (3,), (10,)],
          dtype=[('f0', '<i4')])
addcols(cols, names=None)

Add one or more new columns.

Method wraps:

tabular.spreadsheet.addcols(self, cols, names)

Documentation from tabular.spreadsheet.addcols():

Add one or more columns to a numpy ndarray.

Technical dependency of tabular.spreadsheet.aggregate_in().

Implemented by the tabarray method tabular.tab.tabarray.addcols().

Parameters

X : numpy ndarray with structured dtype or recarray

The recarray to add columns to.

cols : numpy ndarray, or list of arrays of columns

Column(s) to add.

names: list of strings, optional

Names of the new columns. Only applicable when cols is a list of arrays.

Returns

out : numpy ndarray with structured dtype

New numpy array made up of X plus the new columns.

See also: tabular.spreadsheet.colstack()

addrecords(new)

Append one or more records to the end of the array.

Method wraps:

tabular.spreadsheet.addrecords(self, new)

Documentation from tabular.spreadsheet.addrecords():

Append one or more records to the end of a numpy recarray or ndarray .

Can take a single record, void or tuple, or a list of records, voids or tuples.

Implemented by the tabarray method tabular.tab.tabarray.addrecords().

Parameters

X : numpy ndarray with structured dtype or recarray

The array to add records to.

new : record, void or tuple, or list of them

Record(s) to add to X.

Returns

out : numpy ndarray with structured dtype

New numpy array made up of X plus the new records.

See also: tabular.spreadsheet.rowstack()

aggregate(On=None, AggFuncDict=None, AggFunc=None, AggList=None, returnsort=False, KeepOthers=True)

Aggregate a tabarray on columns for given functions.

Method wraps:

tabular.spreadsheet.aggregate(self, On, AggFuncDict, AggFunc, returnsort)

Documentation from tabular.spreadsheet.aggregate():

Aggregate a ndarray with structured dtype (or recarray) on columns for given functions.

Aggregate a numpy recarray (or tabular tabarray) on a set of specified factors, using specified aggregation functions.

Intuitively, this function will aggregate the dataset X on a set of columns, whose names are listed in On, so that the resulting aggregate data set has one record for each unique tuples of values in those columns.

The more factors listed in On argument, the “finer” is the aggregation, the fewer factors, the “coarser” the aggregation. For example, if:

On = ['A','B']

the resulting data set will have one record for each unique value of pairs (a,b) in:

X[['A','B']]

The AggFuncDict argument specifies how to aggregate the factors _not_ listed in On, e.g. the so-called Off columns. For example, if

On = [‘A’,’B’]

and C is some other column, then:

AggFuncDict['C']

is the function that will be used to reduce to a single value the (potentially multiple) values in the C column corresponding to unique values in the A, B columns. For instance, if:

AggFuncDict['C'] = numpy.mean

then the result will be that the values in the C column corresponding to a single A, B value will be averaged.

If an Off column is _not_ provided as a key in AggFuncDict, a default aggregator function will be used: the sum function for numerical columns, concatenation for string columns.

Implemented by the tabarray method tabular.tab.tabarray.aggregate().

Parameters

X : numpy ndarray with structured dtype or recarray

The data set to aggregate.

On : list of strings, optional

List of column names in X.

AggFuncDict : dictionary, optional

Dictionary where

  • keys are some (all) column names of X that are NOT in On
  • values are functions that can be applied to lists or numpy arrays.

This specifies how to aggregate the factors _not_ listed in On, e.g. the so-called Off columns.

AggFunc : function, optional

Function that can be applied to lists or numpy arrays, specifying how to aggregate factors not listed in either On or the keys of AggFuncDict, e.g. a “default” aggregation function for the Off columns not explicitly listed in AggFuncDict.

returnsort : Boolean, optional

If returnsort == True, then return a list of indices describing how X was sorted as a result of aggregation. Default value is False.

Returns

agg : numpy ndarray with structured dtype

Aggregated data set.

index_array : numpy ndarray (int, 1D)

Returned only if returnsort == True. List of indices describing how X was sorted as a result of aggregation.

See also:

aggregate_in(On=None, AggFuncDict=None, AggFunc=None, AggList=None, interspersed=True)

Aggregate a tabarray and include original data in the result.

See the aggregate() method.

Method wraps:

tabular.summarize.aggregate_in(self, On, AggFuncDict, AggFunc, interspersed)

Documentation from tabular.spreadsheet.aggregate_in():

Aggregate a ndarray with structured dtype or recarray and include original data in the result.

Take aggregate of data set on specified columns, then add the resulting rows back into data set to make a composite object containing both original non-aggregate data rows as well as the aggregate rows.

First read comments for tabular.spreadsheet.aggregate().

This function returns a numpy ndarray, with the number of rows equaling:

len(Data) + len(A)

where A is the the result of:

Data.aggregate(On,AggFuncDict)

A represents the aggregate rows; the other rows were the original data rows.

This function supports _multiple_ aggregation, meaning that one can first aggregate on one set of factors, then repeat aggregation on the result for another set of factors, without the results of the first aggregation interfering the second. To achieve this, the method adds two new columns:

  • a column called “__aggregates__” specifying on which factors the rows that are aggregate rows were aggregated. Rows added by aggregating on factor A (a column in the original data set) will have A in the “__aggregates__” column. When multiple factors A1, A2 , ... are aggregated on, the notation is a comma-separated list: A1,A2,.... This way, when you call aggregate_in again, the function only aggregates on the columns that have the empty char ‘’ in their “__aggregates__” column.
  • a column called ‘__color__’, specifying Gray-Scale colors for aggregated rows that will be used by the Data Environment system browser for colorizing the data. When there are multiple levels of aggregation, the coarser aggregate groups (e.g. on fewer factors) get darker gray color then those on finer aggregate groups (e.g. more factors).

Implemented by the tabarray method tabular.tab.tabarray.aggregate_in().

Parameters

Data : numpy ndarray with structured dtype or recarray

The data set to aggregate in.

On : list of strings, optional

List of column names in X.

AggFuncDict : dictionary, optional

Dictionary where

  • keys are some (all) column names of X that are NOT in On
  • values are functions that can be applied to lists or numpy arrays.

This specifies how to aggregate the factors _not_ listed in On, e.g. the so-called Off columns.

AggFunc : function, optional

Function that can be applied to lists or numpy arrays, specifying how to aggregate factors not listed in either On or the keys of AggFuncDict, e.g. a “default” aggregation function for the Off columns not explicitly listed in AggFuncDict.

interspersed : boolean, optional

  • If True, aggregate rows are interleaved with the data of which they are aggregates.
  • If False, all aggregate rows placed at the end of the array.

Returns

agg : numpy ndarray with structured dtype

Composite aggregated data set plus original data set.

See also:

appendHSV(fname, order=None)

Append the tabarray to an existing on-disk HSV representation.

Like saveHSV() but for appending instead of writing from scratch.

Method wraps:

tabular.io.appendHSV(fname, self, order)

Documentation from tabular.io.appendHSV():

Append records to an on-disk tabarray, e.g. HSV directory.

Function for appending records to an on-disk tabarray, used when one wants to write a large tabarray that is not going to be kept in memory at once.

If the tabarray is not there already, the function intializes the tabarray using the tabarray __new__ method, and saves it out.

Parameters

fname : string

Path of hierarchical separated variable (.hsv) file to which to append records in RecObj.

RecObj : array or dictionary

  • Either an array with complex dtype (e.g. tabarray, recarray or ndarray), or

  • a dictionary (ndarray with structured dtype, e.g. a tabarray) where

    • keys are names of columns to append to, and
    • the value on a column is a list of values to be appended to that column.

order : list of strings

List of column names specifying order in which the columns should be written; only used when the HSV does not exist and the header specifying order needs to be written.

See Also:

appendcolumns(fname, order=None)

Append the tabarray to an existing on-disk flat HSV representation.

Like savecolumns() but for appending instead of writing from scratch.

Method wraps:

tabular.io.appendcolumns(fname, self, order)

Documentation from tabular.io.appendcolumns():

Append records to a flat on-disk tabarray, e.g. HSV without subdirectories.

Function for appending columnns a flat on-disk tabarray, (e.g. no colors), used when one wants to write a large tabarray that is not going to be kept in memory at once.

If the tabarray is not there already, the function intializes the tabarray using the tabarray __new__ method, and saves it out.

See tabular.io.appendHSV() for a more general method.

Parameters

fname : string

Path of hierarchical separated variable (.hsv) file of which to append.

RecObj : array or dictionary

  • Either an array with complex dtype (e.g. tabarray, recarray or ndarray), or

  • a dictionary (ndarray with structured dtype, e.g. a tabarray) where

    • keys are names of columns to append to, and
    • the value on a column is a list of values to be appended to that column.

order : list of strings

List of column names specifying order in which the columns should be written; only used when the HSV does not exist and the header specifying order needs to be written.

See Also:

argsort(axis=-1, kind='quicksort', order=None)

Returns the indices that would sort an array.

Note

This method wraps numpy.argsort. This documentation is modified from that of numpy.argsort.

Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as the original array that index data along the given axis in sorted order.

Parameters

axis : int or None, optional

Axis along which to sort. The default is -1 (the last axis). If None, the flattened array is used.

kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, optional

Sorting algorithm.

order : list, optional

This argument specifies which fields to compare first, second, etc. Not all fields need be specified.

Returns

index_array : ndarray, int

Array of indices that sort the tabarray along the specified axis. In other words, a[index_array] yields a sorted a.

See Also

sort : Describes sorting algorithms used. lexsort : Indirect stable sort with multiple keys. ndarray.sort : Inplace sort.

Notes

See numpy.sort for notes on the different sorting algorithms.

Examples

Sorting with keys:

>>> x = tabarray([(1, 0), (0, 1)], dtype=[('x', '<i4'), ('y', '<i4')])
>>> x
tabarray([(1, 0), (0, 1)], 
      dtype=[('x', '<i4'), ('y', '<i4')])
>>> x.argsort(order=('x','y'))
array([1, 0])
>>> x.argsort(order=('y','x'))
array([0, 1])
colstack(new, mode='abort')

Horizontal stacking for tabarrays.

Stack tabarray(s) in new to the right of self.

See also

tabular.tabarray.tab_colstack(), tabular.spreadsheet.colstack()

Documentation from tabular.spreadsheet.colstack():

Horizontally stack a sequence of numpy ndarrays with structured dtypes

Analog of numpy.hstack for recarrays.

Implemented by the tabarray method tabular.tab.tabarray.colstack() which uses tabular.tabarray.tab_colstack().

Parameters

seq : sequence of numpy ndarray with structured dtype

List, tuple, etc. of numpy recarrays to stack vertically.

mode : string in [‘first’,’drop’,’abort’,’rename’]

Denotes how to proceed if when multiple recarrays share the same column name:

  • if mode == first, take the column from the first recarray in seq containing the shared column name.
  • elif mode == abort, raise an error when the recarrays to stack share column names; this is the default mode.
  • elif mode == drop, drop any column that shares its name with any other column among the sequence of recarrays.
  • elif mode == rename, for any set of all columns sharing the same name, rename all columns by appending an underscore, ‘_’, followed by an integer, starting with ‘0’ and incrementing by 1 for each subsequent column.

Returns

out : numpy ndarray with structured dtype

Result of horizontally stacking the arrays in seq.

See also: numpy.hstack.

deletecols(cols)

Delete columns and/or colors.

Method wraps:

tabular.spreadsheet.deletecols(self, cols)

Documentation from tabular.spreadsheet.deletecols():

Delete columns from a numpy ndarry or recarray.

Can take a string giving a column name or comma-separated list of column names, or a list of string column names.

Implemented by the tabarray method tabular.tab.tabarray.deletecols().

Parameters

X : numpy recarray or ndarray with structured dtype

The numpy array from which to delete columns.

cols : string or list of strings

Name or list of names of columns in X. This can be a string giving a column name or comma-separated list of column names, or a list of string column names.

Returns

out : numpy ndarray with structured dtype

New numpy ndarray with structured dtype given by X, excluding the columns named in cols.
extract()

Creates a copy of this tabarray in the form of a numpy ndarray.

Useful if you want to do math on array elements, e.g. if you have a subset of the columns that are all numerical, you can construct a numerical matrix and do matrix operations.

join(ToMerge, keycols=None, nullvals=None, renamer=None, returnrenaming=False, selfname=None, Names=None)

Wrapper for spreadsheet.join, but handles coloring attributes.

The selfname argument allows naming of self to be used if ToMerge is a dictionary.

See also: tabular.spreadsheet.join(), tab_join()

pivot(a, b, Keep=None, NullVals=None, order=None, prefix='_')

Pivot with a as the row axis and b values as the column axis.

Method wraps:

tabular.spreadsheet.pivot(X, a, b, Keep)

Documentation from tabular.spreadsheet.pivot():

Implements pivoting on numpy ndarrays (with structured dtype) or recarrays.

See http://en.wikipedia.org/wiki/Pivot_table for information about pivot tables.

Returns X pivoted on (a,b) with a as the row axis and b values as the column axis.

So-called “nontrivial columns relative to b” in X are added as color-grouped sets of columns, and “trivial columns relative to b” are also retained as cross-grouped sets of columns if they are listed in Keep argument.

Note that a column c in X is “trivial relative to b” if for all rows i, X[c][i] can be determined from X[b][i], e.g the elements in X[c] are in many-to-any correspondence with the values in X[b].

The function will raise an exception if the list of pairs of value in X[[a,b]] is not the product of the individual columns values, e.g.:

X[[a,b]] == set(X[a]) x set(X[b])

in some ordering.

Implemented by the tabarray method tabular.tab.tabarray.pivot()

Parameters

X : numpy ndarray with structured dtype or recarray

The data set to pivot.

a : string

Column name in X.

b : string

Another column name in X.

Keep : list of strings, optional

List of other columns names in X.

NullVals : optional

Dictionary mapping column names in X other than a or b to appropriate null values for their types.

If None, then the null values defined by the nullvalue function are used, see tabular.spreadsheet.nullvalue().

prefix : string, optional

Prefix to add to coloring keys corresponding to cross-grouped “trivial columns relative to b”. Default value is an underscore, ‘_’.

Returns

ptable : numpy ndarray with structured dtype

The resulting pivot table.

coloring : dictionary

Dictionary whose keys are strings and corresponding values are lists of column names (e.g. strings).

There are two groups of keys:

  • So-called “nontrivial columns relative to b” in X. These correspond to columns in:

    set(`X.dtype.names`) - set([a, b])
    
  • Cross-grouped “trivial columns relative to b”. The prefix is used to distinguish these.

The coloring parameter is used by the the tabarray pivot method, tabular.tab.tabarray.pivot().

See tabular.tab.tabarray.__new__() for more information about coloring.

renamecol(old, new)

Rename column or color in-place.

Method wraps:

tabular.spreadsheet.renamecol(self, old, new)

Documentation from tabular.spreadsheet.renamecol():

Rename column of a numpy ndarray with structured dtype, in-place.

Implemented by the tabarray method tabular.tab.tabarray.renamecol().

Parameters

X : numpy ndarray with structured dtype

The numpy array for which a column is to be renamed.

old : string

Old column name, e.g. a name in X.dtype.names.

new : string

New column name to replace old.
replace(old, new, strict=True, cols=None, rows=None)

Replace old with new in the rows rows of columns cols.

Method wraps:

tabular.spreadsheet.replace(self, old, new, strict, cols, rows)

Documentation from tabular.spreadsheet.replace():

Replace value old with new everywhere it appears in-place.

Implemented by the tabarray method tabular.tab.tabarray.replace().

Parameters

X : numpy ndarray with structured dtype

Numpy array for which in-place replacement of old with new is to be done.

old : string

new : string

strict : boolean, optional

  • If strict = True, replace only exact occurences of old.
  • If strict = False, assume old and new are strings and replace all occurences of substrings (e.g. like str.replace())

cols : list of strings, optional

Names of columns to make replacements in; if None, make replacements everywhere.

rows : list of booleans or integers, optional

Rows to make replacements in; if None, make replacements everywhere.

Note: This function does in-place replacements. Thus there are issues handling data types here when replacement dtype is larger than original dtype. This can be resolved later by making a new array when necessary ...

rowstack(new, mode='nulls')

Vertical stacking for tabarrays.

Stack tabarray(s) in new below self.

See also

tabular.tabarray.tab_rowstack(), tabular.spreadsheet.rowstack()

Documentation from tabular.spreadsheet.rowstack():

Vertically stack a sequence of numpy ndarrays with structured dtype

Analog of numpy.vstack

Implemented by the tabarray method tabular.tab.tabarray.rowstack() which uses tabular.tabarray.tab_rowstack().

Parameters

seq : sequence of numpy recarrays

List, tuple, etc. of numpy recarrays to stack vertically.

mode : string in [‘nulls’, ‘commons’, ‘abort’]

Denotes how to proceed if the recarrays have different dtypes, e.g. different sets of named columns.

  • if mode == nulls, the resulting set of columns is determined by the union of the dtypes of all recarrays to be stacked, and missing data is filled with null values as defined by tabular.spreadsheet.nullvalue(); this is the default mode.
  • elif mode == commons, the resulting set of columns is determined by the intersection of the dtypes of all recarrays to be stacked, e.g. common columns.
  • elif mode == abort, raise an error when the recarrays to stack have different dtypes.

Returns

out : numpy ndarray with structured dtype

Result of vertically stacking the arrays in seq.

See also: numpy.vstack.

saveHSV(fname, printheaderfile=True)

Save the tabarray to a hierarchical separated variable (HSV) directory.

Save the tabarray to a .hsv directory. Each column is saved as a separate comma-separated variable file (.csv), whose name includes the column name and data type of the column (e.g. name.int.csv, name.float.csv, name.str.csv).

Hierarchical structure on the columns, i.e. coloring, is preserved by the file directory structure, with subdirectories named color.hsv and containing .csv files corrseponding to columns of data grouped by that color.

Finally, rowdata is stored as a dump of a pickled object in the top level directory fname.

The .hsv can later be loaded back by passing the file path fname to the HSV argument of the tabarray constructor.

Method wraps:

tabular.io.saveHSV(fname, self, printheaderfile)

Documentation from tabular.io.saveHSV():

Save a tabarray to a hierarchical separated variable (HSV) directory.

The tabarray can later be loaded back from the .hsv by passing fname to the HSV argument of the tabarray constructor tabular.tab.tabarray.__new__().

This function is used by the tabarray method tabular.tab.tabarray.saveHSV().

Each column of data in the tabarray is stored inside of the .hsv directory to a separate comma-separated variable text file (.csv), whose name includes the column name and data type of the column (e.g. name.int.csv, name.float.csv, name.str.csv).

Coloring information, i.e. hierarchical structure on the columns, is stored in the file directory structure of the .hsv, where .hsv subdirectories correspond to colors in the coloring dictionary:

X.coloring.keys()

e.g. a subdirectory named color.hsv contains .csv files corrseponding to columns of data grouped by that color:

X['color']

See tabular.tab.tabarray.__new__() for more information about coloring.

Note that when the file structure is not flat, tabular.io.loadHSV() calls itself recursively.

Parameters

fname : string

Path to a .hsv directory or individual .csv text files, corresponding to individual columns of data inside of a .hsv directory.

X : tabarray

The actual data in a tabular.tab.tabarray.

printheaderfile : boolean, optional

Whether or not to print an ordered list of columns names in an additional file header.txt in all .hsv directories. The order is given by:

X.dtype.names

The header.txt file is used by tabular.io.loadHSV() to load the columns of data in the proper order, but is not required.

See Also:

saveSV(fname, comments=None, metadata=None, printmetadict=None, dialect=None, delimiter=None, doublequote=True, lineterminator='n', escapechar=None, quoting=0, quotechar='"', skipinitialspace=False, stringifier=None, verbosity=5)

Save the tabarray to a single flat separated variable (CSV) text file.

Method wraps:

tabular.io.saveSV.      

See docstring of tabular.io.saveSV, or Tabular reference documentation, for more information.

Documentation from tabular.io.saveSV():

Save a tabarray to a separated-variable (CSV) file.

Parameters

fname : string

Path to a separated variable (CSV) text file.

X : tabarray

The actual data in a tabular.tab.tabarray.

comments : string, optional

The character to be used to denote the start of a header (non-data) line, e.g. ‘#’. If not specified, it is determined according to the following rule: ‘#’ if metadata argument is set, otherwise ‘’.

delimiter : string, optional

The character to beused to separate values in each line of text, e.g. ‘,’. If not specified, by default, this is inferred from the file extension: if the file ends in .csv, the delimiter is ‘,’, otherwise it is ‘t.’

linebreak : string, optional

The string separating lines of text. By default, this is assumed to be ‘n’, and can also be set to be ‘r’ or ‘rn’.

metadata : list of strings or Boolean, optional

Allowed values are True, False, or any sublists of the list [‘names’, ‘formats’, ‘types’, ‘coloring’, ‘dialect’]. These keys indicate what special metadata is printed in the header.

  • If a sublist of [‘names’, ‘formats’, ‘types’, ‘coloring’, ‘dialect’], then the indicated types of metadata are written out.
  • If True, this is the same as metadata = [‘coloring’, ‘types’, ‘names’,’dialect’], e.g. as many types of metadata as this algorithm currently knows how to write out.
  • If ‘False’, no metadata is printed at all, e.g. just the data.
  • If metadata is not specified, the default is [‘names’], – that is, just column names are written out.

printmetadict : Boolean, optional

Whether or not to print a string representation of the metadatadict in the first line of the header.

If printmetadict is not specified, then:

  • If metadata is specified and is not False, then printmetadata defaults to True.
  • Else if metadata is False, then printmetadata defaults to False.
  • Else metadata is not specified, and printmetadata defaults to False.

See the tabular.io.loadSV() for more information about metadatadict.

stringifier : callable taking 1-d numpy array and returning python list of strings of same length, or dictionary or tuple of such callables.

If specified, the callable will be applied to each column, and the resulting list of strings will be written to the file. If specified as a list or dictionary of callables, the functions will be applied to correponding columns. The default used if stringifier is not specified, is tb.utils.DEFAULT_STRINGIFIER, which merely passes through string-type columns, and converts numerical-type columns directly to correponding strings with NaNs replaced with blank values. The main purpose of specifying a non-default value is to encode numerical values in various string encodings that might be used required for other applications like databases.

NOTE: In certain special circumstances (e.g. when the lineterminator or delimiter character appears in a field of the data), the python CSV writer is used to write out data. To allow for control of the operation of the writer in these circumstances, the following other parameters replicating the interface of the CSV module are also valid, and values will be passed through: doublequote, escapechar, quoting, quotechar, and skipinitialspace. (See python CSV module documentation for more information.)

See Also:

savebinary(fname, savecoloring=True)

Save the tabarray to a numpy binary archive (.npz).

Save the tabarray to a .npz zipped file containing .npy binary files for data, plus optionally coloring and/or rowdata or simply to a .npy binary file containing the data but no coloring or rowdata.

Method wraps:

tabular.io.savebinary(fname, self, savecoloring, saverowdata)

Documentation from tabular.io.savebinary():

Save a tabarray to a numpy binary file or archive.

Save a tabarray to a numpy binary file (.npy) or archive (.npz) that can be loaded by tabular.io.savebinary().

The .npz file is a zipped archive created using numpy.savez() and containing one or more .npy files, which are NumPy binary files created by numpy.save().

Parameters

fname : string or file-like object

File name or open numpy binary file (.npy) or archive (.npz) created by tabular.io.savebinary().

X : tabarray

The actual data in a tabular.tab.tabarray:

  • if fname is a .npy file, then this is the same as:

    numpy.savez(fname, data=X)
    
  • otherwise, if fname is a .npz file, then X is zipped inside of fname as data.npy

savecoloring : boolean

Whether or not to save the coloring attribute of X. If savecoloring is True, then fname must be a .npz archive and X.coloring is zipped inside of fname as coloring.npy

See tabular.tab.tabarray.__new__() for more information about coloring.

See Also:

tabular.io.loadbinary(), numpy.load(), numpy.save(), numpy.savez()
savecolumns(fname)

Save the tabarray to a set of flat .csv files, one per column.

Save the tabarray to a set of flat .csv files in .hsv format (e.g. .int.csv, .float.csv, .str.csv). Note that data in the coloring attribute is lost.

Method wraps:

tabular.io.savecolumns(fname, self)

Documentation from tabular.io.savecolumns():

Save columns of a tabarray to an existing HSV directory.

Save columns of tabarray X to an existing HSV directory fname (e.g. a .hsv directory created by tabular.io.saveHSV()).

Each column of data in the tabarray is stored inside of the .hsv directory to a separate comma-separated variable text file (.csv), whose name includes the column name and data type of the column (e.g. name.int.csv, name.float.csv, name.str.csv).

Coloring is lost.

This function is used by the tabarray method tabular.tab.tabarray.savecolumns().

Parameters

fname : string

Path to a hierarchical separated variable (HSV) directory (.hsv).

X : tabarray

The actual data in a tabular.tab.tabarray.

See Also:

Previous topic

Web & HTML

Next topic

tabular.io

This Page