formula.gam {mgcv} | R Documentation |
Description of gam
formula (see Details), and how to extract it from a fitted gam
object.
## S3 method for class 'gam' formula(x,...)
x |
fitted model objects of class |
... |
un-used in this case |
gam
will accept a formula or, with some families, a list of formulae.
Other mgcv
modelling functions will not accept a list. The list form provides a mechanism for
specifying several linear predictors, and allows these to share terms: see below.
The formulae supplied to gam
are exactly like those supplied to
glm
except that smooth terms, s
, te
, ti
and t2
can
be added to the right hand side (and .
is not supported in gam
formulae).
Smooth terms are specified by expressions of the form:
s(x1,x2,...,k=12,fx=FALSE,bs="tp",by=z,id=1)
where x1
, x2
, etc. are the covariates which the smooth
is a function of, and k
is the dimension of the basis used to
represent the smooth term. If k
is not
specified then basis specific defaults are used. Note that these defaults are
essentially arbitrary, and it is important to check that they are not so
small that they cause oversmoothing (too large just slows down computation).
Sometimes the modelling context suggests sensible values for k
, but if not
informal checking is easy: see choose.k
and gam.check
.
fx
is used to indicate whether or not this term should be unpenalized,
and therefore have a fixed number of degrees of freedom set by k
(almost always k-1
). bs
indicates the basis to use for the smooth:
the built in options are described in smooth.terms
, and user defined
smooths can be added (see user.defined.smooth
). If bs
is not supplied
then the default "tp"
(tprs
) basis is used.
by
can be used to specify a variable by which
the smooth should be multiplied. For example gam(y~s(x,by=z))
would specify a model E(y)=f(x)z where
f(.) is a smooth function. The by
option is particularly useful for models in
which different functions of the same variable are required for
each level of a factor and for ‘varying coefficient models’: see gam.models
.
id
is used to give smooths identities: smooths with the same identity have
the same basis, penalty and smoothing parameter (but different coefficients, so they are
different functions).
An alternative for specifying smooths of more than one covariate is e.g.:
te(x,z,bs=c("tp","tp"),m=c(2,3),k=c(5,10))
which would specify a tensor product
smooth of the two covariates x
and z
constructed from marginal t.p.r.s. bases
of dimension 5 and 10 with marginal penalties of order 2 and 3. Any combination of basis types is
possible, as is any number of covariates. te
provides further information.
ti
terms are a variant designed to be used as interaction terms when the main
effects (and any lower order interactions) are present. t2
produces tensor product
smooths that are the natural low rank analogue of smoothing spline anova models.
s
, te
, ti
and t2
terms accept an sp
argument of supplied smoothing parameters: positive
values are taken as fixed values to be used, negative to indicate that the parameter should be estimated. If
sp
is supplied then it over-rides whatever is in the sp
argument to gam
, if it is not supplied
then it defaults to all negative, but does not over-ride the sp
argument to gam
.
Formulae can involve nested or “overlapping” terms such as
y~s(x)+s(z)+s(x,z)
or y~s(x,z)+s(z,v)
but nested models should really be set up using ti
terms:
see gam.side
for further details and examples.
Smooth terms in a gam
formula will accept matrix arguments as covariates (and corresponding by
variable),
in which case a ‘summation convention’ is invoked. Consider the example of s(X,Z,by=L)
where X
, Z
and L
are n by m matrices. Let F
be the n by m matrix that results from evaluating the smooth at the values in
X
and Z
. Then the contribution to the linear predictor from the term will be
rowSums(F*L)
(note the element-wise multiplication). This convention allows the linear predictor of the GAM
to depend on (a discrete approximation to) any linear functional of a smooth: see linear.functional.terms
for more information and examples (including functional linear models/signal regression).
Note that gam
allows any term in the model formula to be penalized (possibly by multiple penalties),
via the paraPen
argument. See gam.models
for details and example code.
When several formulae are provided in a list, then they can be used to specify multiple linear predictors
for families for which this makes sense (e.g. mvn
). The first formula in the list must include
a response variable, but later formulae need not (depending on the requirements of the family). Let the linear predictors
be indexed, 1 to d where d is the number of linear predictors, and the indexing is in the order in which the
formulae appear in the list. It is possible to supply extra formulae specifying that several linear predictors
should share some terms. To do this a formula is supplied in which the response is replaced by numbers specifying the
indices of the linear predictors which will shre the terms specified on the r.h.s. For example 1+3~s(x)+z-1
specifies that linear predictors 1 and 3 will share the terms s(x)
and z
(but we don't want an extra intercept, as this would usually be unidentifiable). Note that it is possible that a linear predictor only includes shared terms: it must still have its own formula, but the r.h.s. would simply be -1
(e.g. y ~ -1
or ~ -1
).
Returns the model formula, x$formula
. Provided so that anova
methods
print an appropriate description of the model.
A codegam formula should not refer to variables using e.g. dat[["x"]]
.
Simon N. Wood simon.wood@r-project.org