Title: | Simple Component Analysis |
---|---|
Description: | Simple Component Analysis (SCA) often provides much more interpretable components than Principal Components (PCA) while still representing much of the variability in the data. |
Authors: | Valentin Rousson <[email protected]> and Martin Maechler |
Maintainer: | Martin Maechler <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.9-2 |
Built: | 2024-10-31 22:07:25 UTC |
Source: | https://github.com/cran/sca |
Agglomerate the two block-components which are closest
according to the specified cluster
ing method.
agglomblock(S, P, cluster = c("median","single","complete"))
agglomblock(S, P, cluster = c("median","single","complete"))
S |
correlation/covariance matrix |
P |
component matrix |
cluster |
character specifying the clustering method; default
|
New component matrix with one block component less.
sca
, also for references
Compute simple component criterion for components P
on
cor.matrix S
(cumulative), using sccrit()
.
Function allcrit()
computes even more criteria, some derived
from sccrit()
.
allcrit(S, P, criterion, sortP = TRUE) sccrit(S, P, criterion, sortP = TRUE)
allcrit(S, P, criterion, sortP = TRUE) sccrit(S, P, criterion, sortP = TRUE)
S |
correlation/covariance matrix |
P |
component matrix |
criterion |
character string specifying the optimality criterion
to be used in |
sortP |
logical indicating if |
sccrit()
returns a numeric vector, the criterion computed
(cumulatively).
allcrit()
returns a list with components varpc
,
varsc
, cumpc
, cumsc
, opt
, corsc
,
and maxcor
;
see the description of the allcrit
component in the return value
of sca()
.
Valentin Rousson [email protected] and Martin Maechler [email protected].
quickcrit
, sca
, also for references.
covcomp
returns the variance-covariance matrix of the
components P on S, and corcomp
returns the correlation matrix.
corcomp(S, P) covcomp(S, P)
corcomp(S, P) covcomp(S, P)
S |
correlation/covariance matrix of the |
P |
component matrix of dimension |
a square matrix.
Valentin Rousson [email protected] and Martin Maechler [email protected].
sca
, also for references
data(USJudgeRatings) S.jr <- cor(USJudgeRatings) sca.jr <- sca(S.jr, b=4, inter=FALSE) Vr <- covcomp(S.jr, P = sca.jr$simplemat) Vr Cr <- corcomp(S.jr, P = sca.jr$simplemat) Cr
data(USJudgeRatings) S.jr <- cor(USJudgeRatings) sca.jr <- sca(S.jr, b=4, inter=FALSE) Vr <- covcomp(S.jr, P = sca.jr$simplemat) Vr Cr <- corcomp(S.jr, P = sca.jr$simplemat) Cr
Return the first principal component of residuals of S
given
the components P
.
firstpcres(S, P)
firstpcres(S, P)
S |
correlation/covariance matrix |
P |
component matrix |
numeric vector; actually, the first eigenvector of
where
.
sca
, also for references
The data consist of eight measurements of hearing loss taken on 100 males, aged 39, who had no indication of hearing difficulties. These measurements are decibel loss (in comparison to a reference standard) at frequencies 500Hz, 1000Hz, 2000Hz and 4000Hz for the left and the right ear, respectively.
data(hearlossC)
data(hearlossC)
Eight Variables, first the ones for “Left”, than for the “Right”.
The frequences are abbreviated, e.g., 2k
for 2000 Hz or
5c
for 500 Hz.
The variable names are (in this order)
"Left5c", "Left1k", "Left2k", "Left4k",
"Right5c", "Right1k", "Right2k", "Right4k"
.
This is the correlation matrix of data described in Chapter 5 of Jackson (1991).
Jackson, J.E. (1991) A User's Guide to Principal Components. John Wiley, New York.
data(hearlossC) symnum(hearlossC) sca(hearlossC) # -> explains 89.46% instead of 91.62
data(hearlossC) symnum(hearlossC) sca(hearlossC) # -> explains 89.46% instead of 91.62
return position and value of the largest element of a correlation matrix R (without taking into account the diagonal elements)
maxmatrix(R)
maxmatrix(R)
R |
a square symmetric numeric matrix |
a list with components
row |
row index of maximum |
col |
col index of maximum |
val |
value of maximum, i.e. |
sca
, also for references
data(reflexesC) maxmatrix(reflexesC) # -> 0.98 at [1, 2]
data(reflexesC) maxmatrix(reflexesC) # -> 0.98 at [1, 2]
Compute the next simple difference-component; this is an auxiliary
function for sca
.
nextdiff(S, P, withinblock, criterion)
nextdiff(S, P, withinblock, criterion)
S |
correlation/covariance matrix |
P |
component matrix |
withinblock |
logical indicating whether any given difference-component should only involve variables belonging to the same block-component. |
criterion |
character string specifying the optimality criterion,
see |
Uses firstpcres(S,P)
and subsequently
shrinkdiff()
, the latter in a loop when
withinblock
is true.
In order to ensure uniqueness, we ensure that the first (non zero) entry of the principal component is always positive.
a list with components
P |
the new component matrix, i.e. the input |
nextpc |
the next principal component with many entries set to 0. |
Valentin Rousson [email protected] and Martin Maechler [email protected].
shrinkdiff
; sca
, also for references
Returns strings of the same length as p
, displaying the
100 * p
percentages.
percent(p, d = 0, sep = " ")
percent(p, d = 0, sep = " ")
p |
number(s) in |
d |
number of digits after decimal point. |
sep |
separator to use before the final |
character vector of the same length as p
.
Martin Maechler
percent(0.25) noquote(percent((1:10)/10)) (pc <- percent((1:10)/30, 1, sep="")) noquote(pc)
percent(0.25) noquote(percent((1:10)/10)) (pc <- percent((1:10)/30, 1, sep="")) noquote(pc)
This correlation matrix was published in Jeffers (1967) and was calculated from 180 observations. The 13 variables were used as explanatory variables in a regression problem which arised from a study on the strength of pitprops cut from home-grown timber.
data(pitpropC)
data(pitpropC)
Its a correlation matrix of 13 variables which have the following meaning:
[,1] | TOPDIAM | Top diameter of the prop in inches |
[,2] | LENGTH | Length of the prop in inches |
[,3] | MOIST | Moisture content of the prop, expressed as a percentage of the dry weight |
[,4] | TESTSG | Specific gravity of the timber at the time of the test |
[,5] | OVENSG | Oven-dry specific gravity of the timber |
[,6] | RINGTOP | Number of annual rings at the top of the prop |
[,7] | RINGBUT | Number of annual rings at the base of the prop |
[,8] | BOWMAX | Maximum bow in inches |
[,9] | BOWDIST | Distance of the point of maximum bow from the top of the prop in inches |
[,10] | WHORLS | Number of knot whorls |
[,11] | CLEAR | Length of clear prop from the top of the prop in inches |
[,12] | KNOTS | Average number of knots per whorl |
[,13] | DIAKNOT | Average diameter of the knots in inches |
Jeffers (1967) replaced these 13 variables by their first six principal components. As noted by Vines (2000), this is an example where simple structure has proven difficult to detect in the past.
Jeffers, J.N.R. (1967) Two case studies in the application of principal components analysis. Appl. Statist. 16, 225–236.
Vines, S.K. (2000) Simple principal components. Appl. Statist. 49, 441–451.
data(pitpropC) symnum(pitpropC)
data(pitpropC) symnum(pitpropC)
Compute the additional contribution of a new component to the simple component system P on S.
quickcrit(newcomp, S, P, criterion)
quickcrit(newcomp, S, P, criterion)
newcomp |
numeric vector, typically the result of
|
S |
correlation/covariance matrix |
P |
component matrix |
criterion |
character string specifying the optimality criterion,
see |
...
sccrit
; further sca
, also for
references.
This correlation matrix was published in Jolliffe (2002, p.58). The data consist of measurements of strength of reflexes at ten sites of the body, taken for 143 individuals. The variables come in five pairs, corresponding to right and left measurements on triceps, biceps, wrists, knees and ankles, respectively.
data(reflexesC)
data(reflexesC)
It is a correlation matrix, i.e. symmetric, and
diagonal
1
.
The five pairs of variables are (in this order)
"triceps.R", "triceps.L", "biceps.R", "biceps.L",
"wrist.R", "wrist.L", "knee.R", "knee.L",
"ankle.R", "ankle.L"
.
Jolliffe, I.T. (2002) Principal Component Analysis (2nd ed.). Springer, New York.
data(reflexesC) symnum(reflexesC) sca(reflexesC) # sca gets 97.95% of PCA
data(reflexesC) symnum(reflexesC) sca(reflexesC) # sca gets 97.95% of PCA
A system of simple components calculated from a correlation (or
variance-covariance) matrix is built (interactively if interactive =
TRUE
) following the methodology of Rousson and Gasser (2003).
sca(S, b = if(interactive) 5, d = 0, qmin = if(interactive) 0 else 5, corblocks = if(interactive) 0 else 0.3, criterion = c("csv", "blp"), cluster = c("median","single","complete"), withinblock = TRUE, invertsigns = FALSE, interactive = dev.interactive()) ## S3 method for class 'simpcomp' print(x, ndec = 2, ...)
sca(S, b = if(interactive) 5, d = 0, qmin = if(interactive) 0 else 5, corblocks = if(interactive) 0 else 0.3, criterion = c("csv", "blp"), cluster = c("median","single","complete"), withinblock = TRUE, invertsigns = FALSE, interactive = dev.interactive()) ## S3 method for class 'simpcomp' print(x, ndec = 2, ...)
S |
the correlation (or variance-covariance) matrix to be analyzed. |
b |
the number of block-components initially proposed. |
d |
the number of difference-components initially proposed. |
qmin |
if larger than zero, the number of difference-components
is chosen such that the system contains at least |
corblocks |
if larger than zero, the number of block-components
is chosen such that correlations among them are all smaller than
|
criterion |
character string specifying the optimality criterion
to be used for evaluating a system of simple components. One of
|
cluster |
character string specifying the clustering method to be
used in the definition of the block-components. One of
|
withinblock |
a logical indicating whether any given difference-component should only involve variables belonging to the same block-component. |
invertsigns |
a logical indicating whether the sign of some variables should be inverted initially in order to avoid negative correlations. |
interactive |
a logical indicating whether the system of simple
components should be built interactively. If By default, |
x |
an object of class |
ndec |
number of decimals after the dot, for the percentages printed. |
... |
further arguments, passed to and from methods. |
When confronted with a large number of variables measuring
different aspects of a same theme, the practitionner may like to
summarize the information into a limited number
of components. A
component is a linear combination of the original variables, and
the weights in this linear combination are called the loadings.
Thus, a system of components is defined by a
times
dimensional
matrix of loadings.
Among all systems of components, principal components (PCs) are optimal in many ways. In particular, the first few PCs extract a maximum of the variability of the original variables and they are uncorrelated, such that the extracted information is organized in an optimal way: we may look at one PC after the other, separately, without taking into account the rest.
Unfortunately PCs are often difficult to interpret. The goal of Simple
Component Analysis is to replace (or to supplement) the optimal but
non-interpretable PCs by suboptimal but interpretable simple
components. The proposal of Rousson and Gasser (2003) is to look for
an optimal system of components, but only among the simple ones,
according to some definition of optimality and simplicity. The outcome
of their method is a simple matrix of loadings calculated from the
correlation matrix of the original variables.
Simplicity is not a guarantee for interpretability (but it helps in
this regard). Thus, the user may wish to partly modify an optimal
system of simple components in order to enhance
interpretability. While PCs are by definition 100% optimal, the
optimal system of simple components proposed by the procedure sca
may be, say, 95%, optimal, whereas the simple system altered by the
user may be, say, 93% optimal. It is ultimately to the user to decide
if the gain in interpretability is worth the loss of optimality.
The interactive procedure sca
is intended to assist the user in
his/her choice for an interptetable system of simple components. The
algorithm consists of three distinct stages and proceeds in an
interative way. At each step of the procedure, a simple matrix of
loadings is displayed in a window. The user may alter this matrix by
clicking on its entries, following the instructions given there. If
all the loadings of a component share the same sign, it is a
“block-component”. If some loadings are positive and some loadings
are negative, it is a “difference-component”. Block-components are
arguably easier to interpret than
difference-components. Unfortunately, PCs almost always contain only
one block-component. In the procedure sca
, the user may choose the
number of block-components in the system, the rationale being to have
as many block-components such that correlations among them are below
some cut-off value (typically .3 or .4).
Simple block-components should define a partition of the original
variables. This is done in the first stage of the procedure sca
. An
agglomerative hierarchical clustering procedure is used there.
The second stage of the procedure sca
consists in the definition of
simple difference-components. Those are obtained as simplified
versions of some appropriate “residual components”. The idea is to
retain the large loadings (in absolute value) of these residual
components and to shrink to zero the small ones. For each
difference-component, the interactive procedure sca
displays the
loadings of the corresponding residual component (at the right side of
the window), such that the user may know which variables are
especially important for the definition of this component.
At the third stage of the interactive procedure sca
, it is possible
to remove some of the difference-components from the system.
For many examples, it is possible to find a simple system which is 90% or 95% optimal, and where correlations between components are below 0.3 or 0.4. When the structure in the correlation matrix is complicated, it might be advantageous to invert the sign of some of the variables in order to avoid as much as possible negative correlations. This can be done using the option ‘invertsigns=TRUE’.
In principle, simple components can be calculated from a correlation matrix or from a variance-covariance matrix. However, the definition of simplicity used is not well adapted to the latter case, such that it will result in systems which are far from being 100% optimal. Thus, it is advised to define simple components from a correlation matrix, not from a variance-covariance matrix.
An object of class simpcomp
which is basically as list with
the following components:
simplemat |
an integer matrix defining a system of simple components. The rows correspond to variables and the columns correspond to components. |
loadings |
loadings of simple components. This is a
version of |
allcrit |
a
|
nblock |
number of block-components in |
ndiff |
number of difference-components in |
criterion |
as above. |
cluster |
as above. |
withinblock |
as above. |
invertsigns |
as above |
vardata |
the correlation (or variance-covariance) matrix which
was analyzed. In principle it should be equal to argument |
PCA already is known to be “non-unique” in the sense that the
principal directions (eigen vectors, eigen
) are only
determined up to a factor , i.e., sign change.
Consequently results may change depending e.g., only on the Lapack / BLAS library used. This is even more the case for SCA, notably in artificial situations such as the ‘tests/artif3.R’ in the sources of sca.
Valentin Rousson [email protected] and Martin Maechler [email protected].
Rousson, Valentin and Gasser, Theo (2004) Simple Component Analysis. JRSS: Series C (Applied Statistics) 53(4), 539–555; doi:10.1111/j.1467-9876.2004.05359.x
Rousson, V. and Gasser, Th. (2003) Some Case Studies of Simple Component Analysis. Manuscript, no longer available as ‘https://www.biostat.uzh.ch/research/manuscripts/scacases.pdf’
Gervini, D. and Rousson, V. (2003) Some Proposals for Evaluating Systems of Components in Dimension Reduction Problems. Submitted.
prcomp
(for PCA), etc.
data(pitpropC) sc.pitp <- sca(pitpropC, interactive=FALSE) sc.pitp ## to see it's low-level components: str(sc.pitp) ## Let `X' be a matrix containing some data set whose rows correspond to ## subjects and whose columns correspond to variables. For example: library(MASS) SigU <- function(p, rho) { r <- diag(p); r[col(r) != row(r)] <- rho; r} rmvN <- function(n,p, rho) mvrnorm(n, mu=rep(0,p), Sigma = SigU(p, rho)) X <- cbind(rmvN(100, 3, 0.7), rmvN(100, 2, 0.9), rmvN(100, 4, 0.8)) ## An optimal simple system with at least 5 components for the data in `X', ## where the number of block-components is such that correlations among ## them are all smaller than 0.4, can be automatically obtained as: (r <- sca(cor(X), qmin=5, corblocks=0.4, interactive=FALSE)) ## On the other hand, an optimal simple system with two block-components ## and two difference-components for the data in `X' can be automatically ## obtained as: (r <- sca(cor(X), b=2, d=2, qmin=0, corblocks=0, interactive=FALSE)) ## The resulting simple matrix is contained in `r$simplemat'. ## A matrix of scores for such simple components can then be obtained as: (Z <- scale(X) %*% r$loadings) ## On the other hand, scores of simple components calculated from the ## variance-covariance matrix of `X' can be obtained as: r <- sca(var(X), b=2, d=2, qmin=0, corblocks=0, interactive=FALSE) Z <- scale(X, scale=FALSE) %*% r$loadings ## One can also use the program interactively as follows: if(interactive()) { r <- sca(cor(X), corblocks=0.4, qmin=5, interactive = TRUE) ## Since the interactive part of the program is active here, the proposed ## system can then be modified according to the user's wishes. The ## result of the procedure will be contained in `r'. }
data(pitpropC) sc.pitp <- sca(pitpropC, interactive=FALSE) sc.pitp ## to see it's low-level components: str(sc.pitp) ## Let `X' be a matrix containing some data set whose rows correspond to ## subjects and whose columns correspond to variables. For example: library(MASS) SigU <- function(p, rho) { r <- diag(p); r[col(r) != row(r)] <- rho; r} rmvN <- function(n,p, rho) mvrnorm(n, mu=rep(0,p), Sigma = SigU(p, rho)) X <- cbind(rmvN(100, 3, 0.7), rmvN(100, 2, 0.9), rmvN(100, 4, 0.8)) ## An optimal simple system with at least 5 components for the data in `X', ## where the number of block-components is such that correlations among ## them are all smaller than 0.4, can be automatically obtained as: (r <- sca(cor(X), qmin=5, corblocks=0.4, interactive=FALSE)) ## On the other hand, an optimal simple system with two block-components ## and two difference-components for the data in `X' can be automatically ## obtained as: (r <- sca(cor(X), b=2, d=2, qmin=0, corblocks=0, interactive=FALSE)) ## The resulting simple matrix is contained in `r$simplemat'. ## A matrix of scores for such simple components can then be obtained as: (Z <- scale(X) %*% r$loadings) ## On the other hand, scores of simple components calculated from the ## variance-covariance matrix of `X' can be obtained as: r <- sca(var(X), b=2, d=2, qmin=0, corblocks=0, interactive=FALSE) Z <- scale(X, scale=FALSE) %*% r$loadings ## One can also use the program interactively as follows: if(interactive()) { r <- sca(cor(X), corblocks=0.4, qmin=5, interactive = TRUE) ## Since the interactive part of the program is active here, the proposed ## system can then be modified according to the user's wishes. The ## result of the procedure will be contained in `r'. }
Shrinks a (principal) component towards a simple difference-component
in sca
.
shrinkdiff(zcomp, S, P, criterion)
shrinkdiff(zcomp, S, P, criterion)
zcomp |
a component vector to be simplified. |
S |
correlation/covariance matrix |
P |
component matrix |
criterion |
character string specifying the optimality criterion,
see |
a list with the components
scompmax |
a new simple component vector, typically result of
|
critmax |
the (optimal) value of the criterion, achieved for
|
sca
, also for references
Simplifies the vector x
to become a “simple” component vector
(of the same size).
simpvector(x)
simpvector(x)
x |
numeric vector of length |
a “simplified” version of x
, i.e. an integer vector of the
same length and each entry with the same signs.
sca
, also for references
x0 <- c(-2:3, 3:-1,0:3,1,1) cbind(x0, simpvector(x0)) # entries (-11, 0, 3)
x0 <- c(-2:3, 3:-1,0:3,1,1) cbind(x0, simpvector(x0)) # entries (-11, 0, 3)
Reorder the columns of a component matrix P
by decreasing
variances of components where the block-components come first, the
difference components afterwards.
sortmatrix(S, P)
sortmatrix(S, P)
S |
correlation/covariance matrix |
P |
component matrix |
numeric matrix which is just P
with columns reordered.
sca
, also for references