Title: | Hartigan's Dip Test Statistic for Unimodality - Corrected |
---|---|
Description: | Compute Hartigan's dip test statistic for unimodality / multimodality and provide a test with simulation based p-values, where the original public code has been corrected. |
Authors: | Martin Maechler [aut, cre] (Ported to R and C from Fortran and S-plus; much enhanced, <https://orcid.org/0000-0002-8685-9910>), Dario Ringach [aut] (Fortran and S-plus; at NYU.edu) |
Maintainer: | Martin Maechler <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.77-2 |
Built: | 2024-10-31 20:27:32 UTC |
Source: | https://github.com/mmaechler/diptest |
Computes Hartigans' dip test statistic for testing unimodality, and additionally the modal interval.
dip(x, full.result = FALSE, min.is.0 = FALSE, debug = FALSE)
dip(x, full.result = FALSE, min.is.0 = FALSE, debug = FALSE)
x |
numeric; the data. |
full.result |
logical or string; |
min.is.0 |
logical indicating if the minimal value of the
dip statistic |
debug |
logical; if true, some tracing information is printed (from the C routine). |
depending on full.result
either a number, the dip statistic, or
an object of class "dip"
which is a list
with components
x |
the sorted |
n |
|
dip |
the dip statistic |
lo.hi |
indices into |
xl , xu
|
lower and upper end of modal interval |
gcm , lcm
|
(last used) indices for greatest convex minorant and the least concave majorant. |
mn , mj
|
index vectors of length |
For “full” results of class "dip"
, there are
print
and plot
methods, the latter with
its own manual page.
For where
n <- length(x)
, the dip
statistic is always the same minimum value,
, i.e., there's no possible dip test.
Note that up to May 2011, from Hartigan's original Fortran code,
Dn
was set to zero, when all x
values were identical. However,
this entailed discontinuous behavior, where for arbitrarily close
data ,
.
Yong Lu [email protected] found in Oct 2003 that the code was not giving symmetric results for mirrored data (and was giving results of almost 1, and then found the reason, a misplaced ‘")"’ in the original Fortran code. This bug has been corrected for diptest version 0.25-0 (Feb 13, 2004).
Nick Cox (Durham Univ.) said (on March 20, 2008 on the Stata-list):
As it comes from a bimodal husband-wife collaboration, the name
perhaps should be “Hartigan-Hartigan dip test”, but that
does not seem to have caught on. Some of my less statistical
colleagues would sniff out the hegemony of patriarchy there, although
which Hartigan is being overlooked is not clear.
Martin Maechler, as a Swiss, and politician, would say:
Let's find a compromise, and call it “Hartigans' dip test”,
so we only have to adapt orthography (:-).
Martin Maechler [email protected], 1994,
based on S (S-PLUS) and C code donated from Dario Ringach
[email protected] who had applied f2c
on the
original Fortran code available from Statlib.
In Aug.1993, recreated and improved Hartigans' "P-value" table, which
later became qDiptab
.
P. M. Hartigan (1985)
Computation of the Dip Statistic to Test for Unimodality;
Applied Statistics (JRSS C) 34, 320–325.
Corresponding (buggy!) Fortran code of ‘AS 217’ available from Statlib,
https://lib.stat.cmu.edu/apstat/217
J. A. Hartigan and P. M. Hartigan (1985) The Dip Test of Unimodality; Annals of Statistics 13, 70–84.
dip.test
to compute the dip and perform the unimodality test,
based on P-values, interpolated from qDiptab
;
isoreg
for isotonic regression.
data(statfaculty) plot(density(statfaculty)) rug(statfaculty, col="midnight blue"); abline(h=0, col="gray") dip(statfaculty) (dS <- dip(statfaculty, full = TRUE, debug = TRUE)) plot(dS) ## even more output -- + plot showing "global" GCM/LCM: (dS2 <- dip(statfaculty, full = "all", debug = 3)) plot(dS2) data(faithful) fE <- faithful$eruptions plot(density(fE)) rug(fE, col="midnight blue"); abline(h=0, col="gray") dip(fE, debug = 2) ## showing internal work (dE <- dip(fE, full = TRUE)) ## note the print method plot(dE, do.points=FALSE) data(precip) plot(density(precip)) rug(precip, col="midnight blue"); abline(h=0, col="gray") str(dip(precip, full = TRUE, debug = TRUE)) ##----------------- The 'min.is.0' option : --------------------- ##' dip(.) continuity and 'min.is.0' exploration: dd <- function(x, debug=FALSE) { x_ <- x ; x_[1] <- 0.9999999999 * x[1] rbind(dip(x , debug=debug), dip(x_, debug=debug), dip(x , min.is.0=TRUE, debug=debug), dip(x_, min.is.0=TRUE, debug=debug), deparse.level=2) } dd( rep(1, 8) ) # the 3rd one differs ==> min.is.0=TRUE is *dis*continuous dd( 1:7 ) # ditto dd( 1:7, debug=TRUE) ## border-line case .. dd( 1:2, debug=TRUE) ## Demonstrate that 'min.is.0 = TRUE' does not change the typical result: B.sim <- 1000 # or larger D5 <- {set.seed(1); replicate(B.sim, dip(runif(5)))} D5. <- {set.seed(1); replicate(B.sim, dip(runif(5), min.is.0=TRUE))} stopifnot(identical(D5, D5.), all.equal(min(D5), 1/(2*5))) hist(D5, 64); rug(D5) D8 <- {set.seed(7); replicate(B.sim, dip(runif(8)))} D8. <- {set.seed(7); replicate(B.sim, dip(runif(8), min.is.0=TRUE))} stopifnot(identical(D8, D8.))
data(statfaculty) plot(density(statfaculty)) rug(statfaculty, col="midnight blue"); abline(h=0, col="gray") dip(statfaculty) (dS <- dip(statfaculty, full = TRUE, debug = TRUE)) plot(dS) ## even more output -- + plot showing "global" GCM/LCM: (dS2 <- dip(statfaculty, full = "all", debug = 3)) plot(dS2) data(faithful) fE <- faithful$eruptions plot(density(fE)) rug(fE, col="midnight blue"); abline(h=0, col="gray") dip(fE, debug = 2) ## showing internal work (dE <- dip(fE, full = TRUE)) ## note the print method plot(dE, do.points=FALSE) data(precip) plot(density(precip)) rug(precip, col="midnight blue"); abline(h=0, col="gray") str(dip(precip, full = TRUE, debug = TRUE)) ##----------------- The 'min.is.0' option : --------------------- ##' dip(.) continuity and 'min.is.0' exploration: dd <- function(x, debug=FALSE) { x_ <- x ; x_[1] <- 0.9999999999 * x[1] rbind(dip(x , debug=debug), dip(x_, debug=debug), dip(x , min.is.0=TRUE, debug=debug), dip(x_, min.is.0=TRUE, debug=debug), deparse.level=2) } dd( rep(1, 8) ) # the 3rd one differs ==> min.is.0=TRUE is *dis*continuous dd( 1:7 ) # ditto dd( 1:7, debug=TRUE) ## border-line case .. dd( 1:2, debug=TRUE) ## Demonstrate that 'min.is.0 = TRUE' does not change the typical result: B.sim <- 1000 # or larger D5 <- {set.seed(1); replicate(B.sim, dip(runif(5)))} D5. <- {set.seed(1); replicate(B.sim, dip(runif(5), min.is.0=TRUE))} stopifnot(identical(D5, D5.), all.equal(min(D5), 1/(2*5))) hist(D5, 64); rug(D5) D8 <- {set.seed(7); replicate(B.sim, dip(runif(8)))} D8. <- {set.seed(7); replicate(B.sim, dip(runif(8), min.is.0=TRUE))} stopifnot(identical(D8, D8.))
Compute Hartigans' dip statistic , and
its p-value for the test for unimodality, by interpolating
tabulated quantiles of
.
For ,
the null hypothesis is that
is a unimodal distribution.
Consequently, the test alternative is non-unimodal, i.e., at least
bimodal. Using the language of medical testing, you would call the
test “Test for Multimodality”.
dip.test(x, simulate.p.value = FALSE, B = 2000)
dip.test(x, simulate.p.value = FALSE, B = 2000)
x |
numeric vector; sample to be tested for unimodality. |
simulate.p.value |
a logical indicating whether to compute p-values by Monte Carlo simulation. |
B |
an integer specifying the number of replicates used in the Monte Carlo test. |
If simulate.p.value
is FALSE
, the p-value is computed
via linear interpolation (of ) in the
qDiptab
table.
Otherwise the p-value is computed from a Monte Carlo simulation of a
uniform distribution (runif(n)
) with B
replicates.
A list with class "htest"
containing the following
components:
statistic |
the dip statistic |
p.value |
the p-value for the test, see details. |
method |
character string describing the test, and whether Monte Carlo simulation was used. |
data.name |
a character string giving the name(s) of the data. |
see also the package vignette, which describes the procedure in more details.
Martin Maechler
see those in dip
.
For goodness-of-fit testing, notably of continuous distributions,
ks.test
.
## a first non-trivial case (d.t <- dip.test(c(0,0, 1,1))) # "perfect bi-modal for n=4" --> p-value = 0 stopifnot(d.t$p.value == 0) data(statfaculty) plot(density(statfaculty)); rug(statfaculty) (d.t <- dip.test(statfaculty)) x <- c(rnorm(50), rnorm(50) + 3) plot(density(x)); rug(x) ## border-line bi-modal ... BUT (most of the times) not significantly: dip.test(x) dip.test(x, simulate=TRUE, B=5000) ## really large n -- get a message dip.test(runif(4e5))
## a first non-trivial case (d.t <- dip.test(c(0,0, 1,1))) # "perfect bi-modal for n=4" --> p-value = 0 stopifnot(d.t$p.value == 0) data(statfaculty) plot(density(statfaculty)); rug(statfaculty) (d.t <- dip.test(statfaculty)) x <- c(rnorm(50), rnorm(50) + 3) plot(density(x)); rug(x) ## border-line bi-modal ... BUT (most of the times) not significantly: dip.test(x) dip.test(x, simulate=TRUE, B=5000) ## really large n -- get a message dip.test(runif(4e5))
63 (integer) numbers; unimodal or bimodal, that's the question.
This is now deprecated.
Please use statfaculty
instead!
data(exHartigan) plot(dH <- density(exHartigan)) rug(exHartigan)# should jitter
data(exHartigan) plot(dH <- density(exHartigan)) rug(exHartigan)# should jitter
Plot method for "dip"
objects, i.e., the result of
dip(., full.result=TRUE)
or similar.
Note: We may decide to enhance the plot in the future, possibly not entirely back-compatibly.
## S3 method for class 'dip' plot(x, do.points = (n < 20), colG = "red3", colL = "blue3", colM = "forest green", col.points = par("col"), col.hor = col.points, doModal = TRUE, doLegend = TRUE, ...)
## S3 method for class 'dip' plot(x, do.points = (n < 20), colG = "red3", colL = "blue3", colM = "forest green", col.points = par("col"), col.hor = col.points, doModal = TRUE, doLegend = TRUE, ...)
x |
an R object of |
do.points |
logical indicating if the ECDF plot should include
points; passed to |
colG , colL , colM
|
the colors to be used in the graphics for the Greatest convex minorant, the Least concave majorant, and the Modal interval, respectively. |
col.points , col.hor
|
the color of points or horizontal lines,
respectively, simply passed to |
doModal |
logical indicating if the modal interval |
doLegend |
logical indicating if a legend should be shown. |
... |
further optional arguments, passed to |
Martin Maechler
dip
, also for examples; plot.ecdf
.
Whereas Hartigan(1985) published a table of empirical percentage
points of the dip statistic (see dip
) based on N=9999
samples of size from
, our table of empirical
quantiles is currently based on N=1'000'001 samples for each
.
A numeric matrix
where each row corresponds to sample size , and each column to
a probability (percentage) in
. The dimnames are named
n
and Pr
and coercable to these values, see the
examples. attr(qDiptab, "N_1")
is , such that with
k <- as.numeric(dimnames(qDiptab)$Pr) * attr(qDiptab, "N_1")
,
e.g., qDiptab[n == 15,]
contains exactly the order statistics
(from the
simulated values of
dip(U)
, where U <- runif(15)
.
Taking N=1'000'001 ensures that all the quantile(X, p)
used here are exactly order statistics sort(X)[k]
.
Martin Maechler [email protected], in its earliest form in August 1994.
dip
, also for the references;
dip.test()
which performs the hypothesis test, using
qDtiptab
(and its null hypothesis of a uniform distribution).
data(qDiptab) str(qDiptab) ## the sample sizes `n' : dnqd <- dimnames(qDiptab) (nn <- as.integer(dnqd $n)) ## the probabilities: P.p <- as.numeric(print(dnqd $ Pr)) ## This is as "Table 1" in Hartigan & Hartigan (1985) -- but more accurate ps <- c(1,5,10,50,90,95,99, 99.5, 99.9)/100 tab1 <- qDiptab[nn <= 200, as.character(ps)] round(tab1, 4)
data(qDiptab) str(qDiptab) ## the sample sizes `n' : dnqd <- dimnames(qDiptab) (nn <- as.integer(dnqd $n)) ## the probabilities: P.p <- as.numeric(print(dnqd $ Pr)) ## This is as "Table 1" in Hartigan & Hartigan (1985) -- but more accurate ps <- c(1,5,10,50,90,95,99, 99.5, 99.9)/100 tab1 <- qDiptab[nn <= 200, as.character(ps)] round(tab1, 4)
Faculty quality in statistics departments was assessed as part of a larger study reported by Scully(1982).
Accidentally, this is also provided as the exHartigan
(“example of Hartigans'”) data set.
data(statfaculty)
data(statfaculty)
A numeric vector of 63 (integer) numbers, sorted increasingly, as reported by the reference.
M. G. Scully (1982) Evaluation of 596 programs in mathematics and physical sciences; Chronicle Higher Educ. 25 5, 8–10.
J. A. Hartigan and P. M. Hartigan (1985) The Dip Test of Unimodality; Annals of Statistics 13, 70–84.
data(statfaculty) plot(dH <- density(statfaculty)) rug(jitter(statfaculty)) data(exHartigan) stopifnot(identical(exHartigan,statfaculty))
data(statfaculty) plot(dH <- density(statfaculty)) rug(jitter(statfaculty)) data(exHartigan) stopifnot(identical(exHartigan,statfaculty))