Title: | 'eXtra' / 'eXperimental' Functionality for Robust Statistics |
---|---|
Description: | Robustness -- 'eXperimental', 'eXtraneous', or 'eXtraordinary' Functionality for Robust Statistics. Hence methods which are not well established, often related to methods in package 'robustbase'. Amazingly, 'BACON()', originally by Billor, Hadi, and Velleman (2000) <doi:10.1016/S0167-9473(99)00101-2> has become established in places. The "barrow wheel" `rbwheel()` is from Stahel and Mächler (2009) <doi:10.1111/j.1467-9868.2009.00706.x>. |
Authors: | Martin Maechler [aut, cre] , Werner A. Stahel [aut], Rolf Turner [ctb] (reclas()), Ueli Oetliker [ctb] (original version of BACON() and mvBACON for S+), Tobias Schoch [ctb] (init.sel="V2" for BACON; fix alpha) |
Maintainer: | Martin Maechler <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2-7 |
Built: | 2024-12-13 02:55:40 UTC |
Source: | https://github.com/cran/robustX |
The package robustX aims to be a collection of R functionality for robust statistics of methods and ideas that are considered as proposals, experimental, for experiences or just too much specialized to be part of the “Robust Basics” package robustbase.
Package: | robustX |
Type: | Package |
Title: | 'eXtra' / 'eXperimental' Functionality for Robust Statistics |
Version: | 1.2-7 |
Date: | 2023-06-14 |
Authors@R: | c(person("Martin","Maechler", role=c("aut","cre"), email="[email protected]", comment = c(ORCID = "0000-0002-8685-9910")) , person("Werner A.", "Stahel", role="aut", email="[email protected]") , person("Rolf", "Turner", role="ctb", email="[email protected]", comment = "reclas()") , person("Ueli", "Oetliker", role="ctb", comment = "original version of BACON() and mvBACON for S+") , person("Tobias", "Schoch", role="ctb", comment = "init.sel=\"V2\" for BACON; fix alpha") ) |
Maintainer: | Martin Maechler <[email protected]> |
Description: | Robustness -- 'eXperimental', 'eXtraneous', or 'eXtraordinary' Functionality for Robust Statistics. Hence methods which are not well established, often related to methods in package 'robustbase'. Amazingly, 'BACON()', originally by Billor, Hadi, and Velleman (2000) <doi:10.1016/S0167-9473(99)00101-2> has become established in places. The "barrow wheel" `rbwheel()` is from Stahel and Mächler (2009) <doi:10.1111/j.1467-9868.2009.00706.x>. |
Imports: | grDevices, graphics, stats, utils, robustbase (>= 0.92-3) |
Suggests: | MASS, lattice, pcaPP |
Enhances: | ICS |
License: | GPL (>= 2) |
Encoding: | UTF-8 |
NeedsCompilation: | no |
Packaged: | 2023-06-14 21:41:58 UTC; maechler |
Author: | Martin Maechler [aut, cre] (<https://orcid.org/0000-0002-8685-9910>), Werner A. Stahel [aut], Rolf Turner [ctb] (reclas()), Ueli Oetliker [ctb] (original version of BACON() and mvBACON for S+), Tobias Schoch [ctb] (init.sel="V2" for BACON; fix alpha) |
Date/Publication: | 2023-06-16 07:30:02 UTC |
Repository: | https://mmaechler.r-universe.dev |
RemoteUrl: | https://github.com/cran/robustX |
RemoteRef: | HEAD |
RemoteSha: | 17fd79cce47a9d40d3ea97a0e8c1294c861e07d9 |
Index of help topics:
BACON BACON for Regression or Multivariate Covariance Estimation L1median Compute the Multivariate L1-Median aka 'Spatial Median' Qrot Rotation Matrix to Specific Direction covNNC Robust Covariance Estimation via Nearest Neighbor Cleaning mvBACON BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators rbwheel Multivariate Barrow Wheel Distribution Random Vectors reclas Recursive Robust Median-like Location and Scale robustX-package eXperimental eXtraneous ... Functionality for Robust Statistics
Werner Stahel, Martin Maechler and potentially others
Maintainer: Martin Maechler
Package robustbase which it complements and on which it depends; further package robust and the whole CRAN task view on robust statistics, https://cran.r-project.org/view=Robust
pairs( rbwheel(100, 4) )
pairs( rbwheel(100, 4) )
BACON, short for ‘Blocked Adaptive Computationally-Efficient Outlier Nominators’, is a somewhat robust algorithm (set), with an implementation for regression or multivariate covariance estimation.
BACON()
applies the multivariate (covariance estimation)
algorithm, using mvBACON(x)
in any case, and when
y
is not NULL
adds a regression iteration phase,
using the auxiliary .lmBACON()
function.
BACON(x, y = NULL, intercept = TRUE, m = min(collect * p, n * 0.5), init.sel = c("Mahalanobis", "dUniMedian", "random", "manual", "V2"), man.sel, init.fraction = 0, collect = 4, alpha = 0.05, alphaLM = alpha, maxsteps = 100, verbose = TRUE) ## *Auxiliary* function: .lmBACON(x, y, intercept = TRUE, init.dis, init.fraction = 0, collect = 4, alpha = 0.05, maxsteps = 100, verbose = TRUE)
BACON(x, y = NULL, intercept = TRUE, m = min(collect * p, n * 0.5), init.sel = c("Mahalanobis", "dUniMedian", "random", "manual", "V2"), man.sel, init.fraction = 0, collect = 4, alpha = 0.05, alphaLM = alpha, maxsteps = 100, verbose = TRUE) ## *Auxiliary* function: .lmBACON(x, y, intercept = TRUE, init.dis, init.fraction = 0, collect = 4, alpha = 0.05, maxsteps = 100, verbose = TRUE)
x |
a multivariate matrix of dimension [n x p] considered as containing no missing values. |
y |
the response (n vector) in the case of regression, or
|
intercept |
logical indicating if an intercept has to be used for the regression. |
m |
integer in |
init.sel |
character string, specifying the initial selection
mode; see |
man.sel |
only when |
init.dis |
the distances of the x matrix used for the initial
subset determined by |
init.fraction |
if this parameter is > 0 then the tedious steps of selecting the initial subset are skipped and an initial subset of size n * init.fraction is chosen (with smallest dis) |
collect |
numeric factor chosen by the user to define the size of the initial subset (p * collect) |
alpha |
number in |
alphaLM |
number in |
maxsteps |
the maximal number of iteration steps (to prevent infinite loops) |
verbose |
logical indicating if messages are printed which trace progress of the algorithm. |
Notably about the initial selection mode, init.sel
, see its
description in the mvBACON
arguments list.
The choice of alpha
and alphaLM
:
Multivariate outlier nomination: see the Details section of
mvBACON
.
Regression: Let denote the
quantile of the
Student
-distribution with
degrees of freedom,
where
is the number of elements in the current subset;
e.g.,
is the 0.95 quantile. Following Billor et
al. (2000), the cutoff value for the discrepancies is defined
as
, and
they use
. Note that this is argument
alphaLM
(defualting to alpha
) for BACON()
.
BACON(x,y,..)
(for regression) returns a list
with
components
subset |
the observation indices (in |
tis |
the |
mv.dis |
the (final) discrepancies or distances of
|
mv.subset |
the “good” subset from |
“BACON” was also chosen in honor of Francis Bacon:
Whoever knows the ways of Nature will more easily notice her deviations;
and, on the other hand, whoever knows her deviations will more accurately
describe her ways.
Francis Bacon (1620), Novum Organum II 29.
Ueli Oetliker, Swiss Federal Statistical Office, for S-plus 5.1; 25.05.2001; modified six times till 17.6.2001.
Port to R, testing etc, by Martin Maechler.
Daniel Weeks (at pitt.edu) proposed a fix to a long standing buglet in
GiveTis()
computing the , which was further improved
Maechler, for robustX version 1.2-3 (Feb. 2019).
Correction of alpha
default, from 0.95 to 0.05, by Tobias Schoch,
see mvBACON
.
Billor, N., Hadi, A. S., and Velleman , P. F. (2000). BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators; Computational Statistics and Data Analysis 34, 279–298. doi:10.1016/S0167-9473(99)00101-2
mvBACON
, the multivariate version of the BACON
algorithm.
data(starsCYG, package = "robustbase") ## Plot simple data and fitted lines plot(starsCYG) lmST <- lm(log.light ~ log.Te, data = starsCYG) abline(lmST, col = "gray") # least squares line str(B.ST <- with(starsCYG, BACON(x = log.Te, y = log.light))) ## 'subset': A good set of of points (to determine regression): colB <- adjustcolor(2, 1/2) points(log.light ~ log.Te, data = starsCYG, subset = B.ST$subset, pch = 19, cex = 1.5, col = colB) ## A BACON-derived line: lmB <- lm(log.light ~ log.Te, data = starsCYG, subset = B.ST$subset) abline(lmB, col = colB, lwd = 2) require(robustbase) (RlmST <- lmrob(log.light ~ log.Te, data = starsCYG)) abline(RlmST, col = "blue")
data(starsCYG, package = "robustbase") ## Plot simple data and fitted lines plot(starsCYG) lmST <- lm(log.light ~ log.Te, data = starsCYG) abline(lmST, col = "gray") # least squares line str(B.ST <- with(starsCYG, BACON(x = log.Te, y = log.light))) ## 'subset': A good set of of points (to determine regression): colB <- adjustcolor(2, 1/2) points(log.light ~ log.Te, data = starsCYG, subset = B.ST$subset, pch = 19, cex = 1.5, col = colB) ## A BACON-derived line: lmB <- lm(log.light ~ log.Te, data = starsCYG, subset = B.ST$subset) abline(lmB, col = colB, lwd = 2) require(robustbase) (RlmST <- lmrob(log.light ~ log.Te, data = starsCYG)) abline(RlmST, col = "blue")
covNNC()
estimates robust covariance/dispersion matrices by the
nearest neighbor variance estimation (NNVE) or (rather)
“Nearest Neighbor Cleaning” (NNC) method of Wang and Raftery
(2002, JASA).
covNNC(X, k = min(12, n - 1), pnoise = 0.05, emconv = 0.001, bound = 1.5, extension = TRUE, devsm = 0.01)
covNNC(X, k = min(12, n - 1), pnoise = 0.05, emconv = 0.001, bound = 1.5, extension = TRUE, devsm = 0.01)
X |
matrix in which each row represents an observation or point and each column represents a variable. |
k |
desired number of nearest neighbors (default is 12) |
pnoise |
percent of added noise |
emconv |
convergence tolerance for EM |
bound |
value used to identify surges in variance caused by
outliers wrongly included as signal points ( |
extension |
whether or not to continue after reaching the last
chi-square distance. The default is to continue,
which is indicated by setting |
devsm |
when |
A list with components
cov |
covariance matrix |
mu |
mean vector |
postprob |
posterior probability |
classification |
classification (0=noise otherwise 1) obtained
by rounding |
innc |
list of initial nearest neighbor cleaning results (components are the covariance, mean, posterior probability and classification) |
Terms of use: GPL version 2 or newer.
MM: Even though covNNC()
is backed by a serious scientific
publication, I cannot recommend its use at all.
Naisyin Wang [email protected] and Adrian Raftery [email protected] with contributions from Chris Fraley [email protected].
covNNC()
, then named cov.nnve()
, used to be (the only
function) in CRAN package covRobust (2003), which was archived
in 2012.
Martin Maechler allowed ncol(X) == 1
,
sped up the original code, by reducing the amount of scaling;
further, the accuracy was increased (using internal q.dDk()
).
The original version is available, unexported as
robustX:::covNNC1
.
Wang, N. and Raftery, A. (2002) Nearest neighbor variance estimation (NNVE): Robust covariance estimation via nearest neighbor cleaning (with discussion). Journal of the American Statistical Association 97, 994–1019.
See also University of Washington Statistics Technical Report 368 (2000); see at https://stat.uw.edu/research/tech-reports/
cov.mcd
from package MASS;
covMcd
, and covOGK
from package robustbase.
The whole package rrcov.
data(iris) covNNC(iris[-5]) data(hbk, package="robustbase") hbk.x <- data.matrix(hbk[, 1:3]) covNNC(hbk.x)
data(iris) covNNC(iris[-5]) data(hbk, package="robustbase") hbk.x <- data.matrix(hbk[, 1:3]) covNNC(hbk.x)
Compute the multivariate -median
, also called
“Spatial Median”, i.e., the
minimizer of
where .
As a convex problem, there's always a global minimizer, computable not by a closed formula but rather an iterative search. As the (partial) first derivatives of the objective function is undefined at the data points, the minimization is not entirely trivial.
L1median(X, m.init = colMedians(X), weights = NULL, method = c("nlm", "HoCrJo", "VardiZhang", optimMethods, nlminbMethods), pscale = apply(abs(centr(X, m.init)), 2, mean, trim = 0.40), tol = 1e-08, maxit = 200, trace = FALSE, zero.tol = 1e-15, ...)
L1median(X, m.init = colMedians(X), weights = NULL, method = c("nlm", "HoCrJo", "VardiZhang", optimMethods, nlminbMethods), pscale = apply(abs(centr(X, m.init)), 2, mean, trim = 0.40), tol = 1e-08, maxit = 200, trace = FALSE, zero.tol = 1e-15, ...)
X |
numeric |
m.init |
starting value for |
weights |
optional numeric vector of non-negative weights;
currently only implemented for method |
method |
character string specifying the computational method, i.e., the algorithm to be used (can be abbreviated). |
pscale |
numeric p-vector of positive numbers,
the coordinate-wise scale (typical size of
|
tol |
positive number specifying the (relative) convergence tolerance. |
maxit |
positive integer specifying the maximal number of iterations (before the iterations are stopped prematurely if necessary). |
trace |
an integer specifying the tracing level of the
iterations; |
zero.tol |
for method |
... |
optional arguments to |
Currently, we have to refer to the “References” below.
currently the result depends strongly on the method
used.
FIXME. This will change considerably.
Martin Maechler. Method "HoCrJo"
is mostly based on Kristel
Joossens' R function, implementing Hossjer and Croux (1995).
Hossjer and Croux, C. (1995). Generalizing Univariate Signed Rank Statistics for Testing and Estimating a Multivariate Location Parameter. Non-parametric Statistics 4, 293–308.
Vardi, Y. and Zhang, C.-H. (2000).
The multivariate -median and associated data depth.
Proc. National Academy of Science 97(4), 1423–1426.
Fritz, H. and Filzmoser, P. and Croux, C. (2012) A comparison of algorithms for the multivariate L1-median. Computational Statistics 27, 393–410.
Kent, J. T., Er, F. and Constable, P. D. L. (2015) Algorithms for the spatial median;, in K. Nordhausen and S. Taskinen (eds), Modern Nonparametric, Robust and Multivariate Methods: Festschrift in Honour of Hannu Oja, Springer International Publishing, chapter 12, pp. 205–224. doi:10.1007/978-3-319-22404-6_12
CRAN package pcaPP added more L1 median methods,
re-implementing our R versions in C++, see Fritz et al.(2012) and
e.g., l1median_NLM()
.
data(stackloss) L1median(stackloss) L1median(stackloss, method = "HoCrJo") ## Explore all methods: m <- eval(formals(L1median)$method); allMeths <- m[m != "Brent"] L1m <- sapply(allMeths, function(meth) L1median(stackloss, method = meth)) ## --> with a warning for L-BFGS-B str(L1m) pm <- sapply(L1m, function(.) if(is.numeric(.)) . else .$par) t(pm) # SANN differs a bit; same objective ?
data(stackloss) L1median(stackloss) L1median(stackloss, method = "HoCrJo") ## Explore all methods: m <- eval(formals(L1median)$method); allMeths <- m[m != "Brent"] L1m <- sapply(allMeths, function(meth) L1median(stackloss, method = meth)) ## --> with a warning for L-BFGS-B str(L1m) pm <- sapply(L1m, function(.) if(is.numeric(.)) . else .$par) t(pm) # SANN differs a bit; same objective ?
This function performs an outlier identification algorithm to the data in the x array [n x p] and y vector [n] following the lines described by Hadi et al. for their BACON outlier procedure.
mvBACON(x, collect = 4, m = min(collect * p, n * 0.5), alpha = 0.05, init.sel = c("Mahalanobis", "dUniMedian", "random", "manual", "V2"), man.sel, maxsteps = 100, allowSingular = FALSE, verbose = TRUE)
mvBACON(x, collect = 4, m = min(collect * p, n * 0.5), alpha = 0.05, init.sel = c("Mahalanobis", "dUniMedian", "random", "manual", "V2"), man.sel, maxsteps = 100, allowSingular = FALSE, verbose = TRUE)
x |
numeric matrix (of dimension |
collect |
a multiplication factor |
m |
integer in |
alpha |
determines the cutoff value for the Mahalanobis distances (see details). |
init.sel |
character string, specifying the initial selection mode; implemented modes are:
|
man.sel |
only when |
maxsteps |
maximal number of iteration steps. |
allowSingular |
logical indicating a solution should be sought
also when no matrix of rank |
verbose |
logical indicating if messages are printed which trace progress of the algorithm. |
Remarks on the tuning parameter alpha
: Let
be a chi-square distributed random variable with
degrees
of freedom (
is the number of variables;
is the
number of observations). Denote the
quantile by
, e.g.,
is the 0.95 quantile.
Following Billor et al. (2000), the cutoff value for the
Mahalanobis distances is defined as
(the square
root of
) times a correction factor
,
and
,
and they use
.
a list
with components
subset |
logical vector of length |
dis |
numeric vector of length |
cov |
|
Ueli Oetliker, Swiss Federal Statistical Office, for S-plus 5.1.
Port to R, testing etc, by Martin Maechler;
Init selection "V2"
and correction of default alpha
from 0.95 to 0.05,
by Tobias Schoch, FHNW Olten, Switzerland.
Billor, N., Hadi, A. S., and Velleman , P. F. (2000). BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators; Computational Statistics and Data Analysis 34, 279–298. doi:10.1016/S0167-9473(99)00101-2
covMcd
for a high-breakdown (but more computer
intensive) method;
BACON
for a “generalization”, notably to
regression.
require(robustbase) # for example data and covMcd(): ## simple 2D example : plot(starsCYG, main = "starsCYG data (n=47)") B.st <- mvBACON(starsCYG) points(starsCYG[ ! B.st$subset,], pch = 4, col = 2, cex = 1.5) stopifnot(identical(which(!B.st$subset), c(7L,11L,20L,30L,34L))) ## finds the 4 clear outliers (and 1 "borderline"); ## it does not find obs. 14 which is an outlier according to covMcd(.) iniS <- setNames(, eval(formals(mvBACON)$init.sel)) # all initialization methods, incl "random" set.seed(123) Bs.st <- lapply(iniS[iniS != "manual"], function(s) mvBACON(as.matrix(starsCYG), init.sel = s, verbose=FALSE)) ii <- - match("steps", names(Bs.st[[1]])) Bs.s1 <- lapply(Bs.st, `[`, ii) stopifnot(exprs = { length(Bs.s1) >= 4 length(unique(Bs.s1)) == 1 # all 4 methods give the same }) ## Example where "dUniMedian" and "V2" differ : data(pulpfiber, package="robustbase") dU.plp <- mvBACON(as.matrix(pulpfiber), init.sel = "dUniMedian") V2.plp <- mvBACON(as.matrix(pulpfiber), init.sel = "V2") (oU <- which(! dU.plp$subset)) (o2 <- which(! V2.plp$subset)) stopifnot(setdiff(o2, oU) %in% c(57L,58L,59L,62L)) ## and 57, 58, 59, and 62 *are* outliers according to covMcd(.) ## 'coleman' from pkg 'robustbase' coleman.x <- data.matrix(coleman[, 1:6]) Cc <- covMcd (coleman.x) # truly robust summary(Cc) # -> 6 outliers (1,3,10,12,17,18) Cb1 <- mvBACON(coleman.x) ##-> subset is all TRUE hmm?? Cb2 <- mvBACON(coleman.x, init.sel = "dUniMedian") stopifnot(all.equal(Cb1, Cb2)) ## try 20 different random starts: Cb.r <- lapply(1:20, function(i) { set.seed(i) mvBACON(coleman.x, init.sel="random", verbose=FALSE) }) nm <- names(Cb.r[[1]]); nm <- nm[nm != "steps"] all(eqC <- sapply(Cb.r[-1], function(CC) all.equal(CC[nm], Cb.r[[1]][nm]))) # TRUE ## --> BACON always breaks down, i.e., does not see the outliers here ## breaks down even when manually starting with all the non-outliers: Cb.man <- mvBACON(coleman.x, init.sel = "manual", man.sel = setdiff(1:20, c(1,3,10,12,17,18))) which( ! Cb.man$subset) # the outliers according to mvBACON : _none_
require(robustbase) # for example data and covMcd(): ## simple 2D example : plot(starsCYG, main = "starsCYG data (n=47)") B.st <- mvBACON(starsCYG) points(starsCYG[ ! B.st$subset,], pch = 4, col = 2, cex = 1.5) stopifnot(identical(which(!B.st$subset), c(7L,11L,20L,30L,34L))) ## finds the 4 clear outliers (and 1 "borderline"); ## it does not find obs. 14 which is an outlier according to covMcd(.) iniS <- setNames(, eval(formals(mvBACON)$init.sel)) # all initialization methods, incl "random" set.seed(123) Bs.st <- lapply(iniS[iniS != "manual"], function(s) mvBACON(as.matrix(starsCYG), init.sel = s, verbose=FALSE)) ii <- - match("steps", names(Bs.st[[1]])) Bs.s1 <- lapply(Bs.st, `[`, ii) stopifnot(exprs = { length(Bs.s1) >= 4 length(unique(Bs.s1)) == 1 # all 4 methods give the same }) ## Example where "dUniMedian" and "V2" differ : data(pulpfiber, package="robustbase") dU.plp <- mvBACON(as.matrix(pulpfiber), init.sel = "dUniMedian") V2.plp <- mvBACON(as.matrix(pulpfiber), init.sel = "V2") (oU <- which(! dU.plp$subset)) (o2 <- which(! V2.plp$subset)) stopifnot(setdiff(o2, oU) %in% c(57L,58L,59L,62L)) ## and 57, 58, 59, and 62 *are* outliers according to covMcd(.) ## 'coleman' from pkg 'robustbase' coleman.x <- data.matrix(coleman[, 1:6]) Cc <- covMcd (coleman.x) # truly robust summary(Cc) # -> 6 outliers (1,3,10,12,17,18) Cb1 <- mvBACON(coleman.x) ##-> subset is all TRUE hmm?? Cb2 <- mvBACON(coleman.x, init.sel = "dUniMedian") stopifnot(all.equal(Cb1, Cb2)) ## try 20 different random starts: Cb.r <- lapply(1:20, function(i) { set.seed(i) mvBACON(coleman.x, init.sel="random", verbose=FALSE) }) nm <- names(Cb.r[[1]]); nm <- nm[nm != "steps"] all(eqC <- sapply(Cb.r[-1], function(CC) all.equal(CC[nm], Cb.r[[1]][nm]))) # TRUE ## --> BACON always breaks down, i.e., does not see the outliers here ## breaks down even when manually starting with all the non-outliers: Cb.man <- mvBACON(coleman.x, init.sel = "manual", man.sel = setdiff(1:20, c(1,3,10,12,17,18))) which( ! Cb.man$subset) # the outliers according to mvBACON : _none_
Construct the rotation matrix that rotates the
unit vector (1,0,....0), i.e., the
-axis,
onto (1,1,1,...1)/
, or more generally to
(
unit.image
).
Qrot(p, transpose = FALSE, unit.image = rep(1, p))
Qrot(p, transpose = FALSE, unit.image = rep(1, p))
p |
integer; the dimension (of the vectors involved). |
transpose |
logical indicating if the transposed matrix is to returned. |
unit.image |
numeric vector of length |
The qr
decomposition is used for a Gram-Schmitt basis
orthogonalization.
orthogonal matrix which rotates
onto a vector proportional to
unit.image
.
Martin Maechler
qr
, matrix (and vector) multiplication,
%*%
.
Q <- Qrot(6) zapsmall(crossprod(Q)) # 6 x 6 unity <==> Q'Q = I <==> Q orthogonal if(require("MASS")) { Qt <- Qrot(6, transpose = TRUE) stopifnot(all.equal(Qt, t(Q))) fractions(Qt ^2) # --> 1/6 1/30 etc, in an almost lower-triagonal matrix }
Q <- Qrot(6) zapsmall(crossprod(Q)) # 6 x 6 unity <==> Q'Q = I <==> Q orthogonal if(require("MASS")) { Qt <- Qrot(6, transpose = TRUE) stopifnot(all.equal(Qt, t(Q))) fractions(Qt ^2) # --> 1/6 1/30 etc, in an almost lower-triagonal matrix }
Generate -dimensional random vectors according to Stahel's
Barrow Wheel Distribution.
rbwheel(n, p, frac = 1/p, sig1 = 0.05, sig2 = 1/10, rGood = rnorm, rOut = function(n) sqrt(rchisq(n, p - 1)) * sign(runif(n, -1, 1)), U1 = rep(1, p), scaleAfter = TRUE, scaleBefore = FALSE, spherize = FALSE, fullResult = FALSE)
rbwheel(n, p, frac = 1/p, sig1 = 0.05, sig2 = 1/10, rGood = rnorm, rOut = function(n) sqrt(rchisq(n, p - 1)) * sign(runif(n, -1, 1)), U1 = rep(1, p), scaleAfter = TRUE, scaleBefore = FALSE, spherize = FALSE, fullResult = FALSE)
n |
integer, specifying the sample size. |
p |
integer, specifying the dimension (aka number of variables). |
frac |
numeric, the proportion of outliers.
The default, |
sig1 |
thickness of the “wheel”, ( |
sig2 |
thickness of the “axis” (compared to 1). |
rGood |
function; the generator for “good” observations. |
rOut |
function, generating the outlier observations. |
U1 |
p-vector to which |
scaleAfter |
logical indicating if the matrix is re-scaled after
rotation (via |
scaleBefore |
logical indicating if the matrix is re-scaled before
rotation (via |
spherize |
logical indicating if the matrix is to be
“spherized”, i.e., rotated and scaled to have empirical
covariance |
fullResult |
logical indicating if in addition to the |
....
By default (when fullResult
is FALSE
), an
matrix of
sample vectors of the
dimensional barrow wheel distribution, with an attribute,
n1
specifying the exact number of “good” observations,
,
frac
.
If fullResult
is TRUE
, a list with components
X |
the |
X0 |
......... |
A |
the |
n1 |
the number of “good” observations, see above. |
n2 |
the number of “outlying” observations, |
Werner Stahel and Martin Maechler
http://stat.ethz.ch/people/maechler/robustness
Stahel, W.~A. and Mächler, M. (2009). Comment on “invariant co-ordinate selection”, Journal of the Royal Statistical Society B 71, 584–586. doi:10.1111/j.1467-9868.2009.00706.x
set.seed(17) rX8 <- rbwheel(1000,8, fullResult = TRUE, scaleAfter=FALSE) with(rX8, stopifnot(all.equal(X, X0 %*% A, tol = 1e-15), all.equal(X0, X %*% t(A), tol = 1e-15))) ##--> here, don't need to keep X0 (nor A, since that is Qrot(p)) ## for n = 100, you don't see "it", but may guess .. : n <- 100 pairs(r <- rbwheel(n,6)) n1 <- attr(r,"n1") ; pairs(r, col=1+((1:n) > n1)) ## for n = 500, you *do* see it : n <- 500 pairs(r <- rbwheel(n,6)) ## show explicitly n1 <- attr(r,"n1") ; pairs(r, col=1+((1:n) > n1)) ## but increasing sig2 does help: pairs(r <- rbwheel(n,6, sig2 = .2)) ## show explicitly n1 <- attr(r,"n1") ; pairs(r, col=1+((1:n) > n1)) set.seed(12) pairs(X <- rbwheel(n, 7, spherize=TRUE)) colSums(X) # already centered if(require("ICS") && require("robustbase")) { # ICS: Compare M-estimate [Max.Lik. of t_{df = 2}] with high-breakdown : stopifnot(require("MASS")) X.paM <- ics(X, S1 = cov, S2 = function(.) cov.trob(., nu=2)$cov, stdKurt = FALSE) X.paM.<- ics(X, S1 = cov, S2 = function(.) tM(., df=2)$V, stdKurt = FALSE) X.paR <- ics(X, S1 = cov, S2 = function(.) covMcd(.)$cov, stdKurt = FALSE) plot(X.paM) # not at all clear plot(X.paM.)# ditto plot(X.paR)# very clear } ## Similar such experiments ---> demo(rbwheel_d) and demo(rbwheel_ics) ## -------------- -----------------
set.seed(17) rX8 <- rbwheel(1000,8, fullResult = TRUE, scaleAfter=FALSE) with(rX8, stopifnot(all.equal(X, X0 %*% A, tol = 1e-15), all.equal(X0, X %*% t(A), tol = 1e-15))) ##--> here, don't need to keep X0 (nor A, since that is Qrot(p)) ## for n = 100, you don't see "it", but may guess .. : n <- 100 pairs(r <- rbwheel(n,6)) n1 <- attr(r,"n1") ; pairs(r, col=1+((1:n) > n1)) ## for n = 500, you *do* see it : n <- 500 pairs(r <- rbwheel(n,6)) ## show explicitly n1 <- attr(r,"n1") ; pairs(r, col=1+((1:n) > n1)) ## but increasing sig2 does help: pairs(r <- rbwheel(n,6, sig2 = .2)) ## show explicitly n1 <- attr(r,"n1") ; pairs(r, col=1+((1:n) > n1)) set.seed(12) pairs(X <- rbwheel(n, 7, spherize=TRUE)) colSums(X) # already centered if(require("ICS") && require("robustbase")) { # ICS: Compare M-estimate [Max.Lik. of t_{df = 2}] with high-breakdown : stopifnot(require("MASS")) X.paM <- ics(X, S1 = cov, S2 = function(.) cov.trob(., nu=2)$cov, stdKurt = FALSE) X.paM.<- ics(X, S1 = cov, S2 = function(.) tM(., df=2)$V, stdKurt = FALSE) X.paR <- ics(X, S1 = cov, S2 = function(.) covMcd(.)$cov, stdKurt = FALSE) plot(X.paM) # not at all clear plot(X.paM.)# ditto plot(X.paR)# very clear } ## Similar such experiments ---> demo(rbwheel_d) and demo(rbwheel_ics) ## -------------- -----------------
Calculate an estimate of location, asymptotically equivalent to the median, and an estimate of scale equal to the MEAN absolute deviation. Both done recursively.
reclas(y, b = 0.2, mfn = function(n) 0.1 * n^(-0.25), nstart = 30, m0 = median(y0), scon=NULL, updateScale = is.null(scon))
reclas(y, b = 0.2, mfn = function(n) 0.1 * n^(-0.25), nstart = 30, m0 = median(y0), scon=NULL, updateScale = is.null(scon))
y |
numeric vector of i.i.d. data whose location and scale parameters are to be estimated. |
b |
numeric tuning parameter (default value equal to that used by Holst, 1987). |
mfn |
a |
nstart |
number of starting values: Starting values for the
algorithm are formed from the first |
m0 |
value for the initial approximate median; by default, the
|
scon |
value for the scale parameter |
updateScale |
a logical indicating if the scale, initialized
from |
An S3 “object” of class
"reclas"
; simply a
list with entries
locn |
the successive recursive estimates of location. The
first |
scale |
the successive recursive estimates of scale if
|
updateScale |
the same as the function argument. |
call |
the function call, i.e., |
There is a plot
method for "reclas"
, see the
examples.
[email protected] http://www.stat.auckland.ac.nz/~rolf
Extensions by Martin Maechler (scon
as function;
updateScale
, plot()
).
Cameron, Murray A. and Turner, T. Rolf (1993). Recursive location and scale estimators. Commun. Statist. — Theory Meth. 22(9) 2503–2515.
Holst, U. (1987). Recursive estimators of location. Commun. Statist. — Theory Meth. 16 (8) 2201–2226.
set.seed(42) y <- rt(10000, df = 1.5) # not quite Gaussian ... z1 <- reclas(y) z3 <- reclas(y, scon= 1 ) # correct fixed scale z4 <- reclas(y, scon= 100) # wrong fixed scale z2 <- reclas(y, # a more robust initial scale: scon = function(y0, m0) robustbase::Qn(y0 - m0), updateScale = TRUE) # still updated ## Visualizing -- using the plot() method for "reclas": M <- median(y) ; yl <- c(-1,1)* 0.5 OP <- par(mfrow=c(2,2), mar=.1+c(3,3,1,1), mgp=c(1.5, .6, 0)) plot(z1, M=M, ylim=yl) plot(z2, M=M, ylim=yl) plot(z3, M=M, ylim=yl) plot(z4, M=M, ylim=yl) par(OP)
set.seed(42) y <- rt(10000, df = 1.5) # not quite Gaussian ... z1 <- reclas(y) z3 <- reclas(y, scon= 1 ) # correct fixed scale z4 <- reclas(y, scon= 100) # wrong fixed scale z2 <- reclas(y, # a more robust initial scale: scon = function(y0, m0) robustbase::Qn(y0 - m0), updateScale = TRUE) # still updated ## Visualizing -- using the plot() method for "reclas": M <- median(y) ; yl <- c(-1,1)* 0.5 OP <- par(mfrow=c(2,2), mar=.1+c(3,3,1,1), mgp=c(1.5, .6, 0)) plot(z1, M=M, ylim=yl) plot(z2, M=M, ylim=yl) plot(z3, M=M, ylim=yl) plot(z4, M=M, ylim=yl) par(OP)