Kernel-based quadratic distance (KBQD) Goodness-of-Fit tests

This function performs the kernel-based quadratic distance goodness-of-fit tests. It includes tests for multivariate normality, two-sample tests and $k$-sample tests.

Usage

kb.test(
  x,
  y = NULL,
  h = NULL,
  method = "subsampling",
  B = 150,
  b = NULL,
  Quantile = 0.95,
  mu_hat = NULL,
  Sigma_hat = NULL,
  centeringType = "Nonparam",
  K_threshold = 10,
  alternative = "skewness"
)

# S4 method for ANY
kb.test(
  x,
  y = NULL,
  h = NULL,
  method = "subsampling",
  B = 150,
  b = 0.9,
  Quantile = 0.95,
  mu_hat = NULL,
  Sigma_hat = NULL,
  centeringType = "Nonparam",
  K_threshold = 10,
  alternative = "skewness"
)

# S4 method for kb.test
show(object)

Arguments

x

Numeric matrix or vector of data values.

y

Numeric matrix or vector of data values. Depending on the input y, the corresponding test is performed.

if y = NULL, the function performs the tests for normality on x
if y is a data matrix, with same dimensions of x, the function performs the two-sample test between x and y.
if y is a numeric or factor vector, indicating the group memberships for each observation, the function performs the k-sample test.

h

Bandwidth for the kernel function. If a value is not provided, the algorithm for the selection of an optimal h is performed automatically. See the function select_h for more details.

method

The method used for critical value estimation ("subsampling", "bootstrap", or "permutation")(default: "subsampling").

B

The number of iterations to use for critical value estimation (default: 150).

b

The size of the subsamples used in the subsampling algorithm (default: 0.8).

Quantile

The quantile to use for critical value estimation, 0.95 is the default value.

mu_hat

Mean vector for the reference distribution.

Sigma_hat

Covariance matrix of the reference distribution.

centeringType

String indicating the method used for centering the normal kernel ('Param' or 'Nonparam').

K_threshold

maximum number of groups allowed. Default is 10. It is a control parameter. Change in case of more than 10 samples.

alternative

Family of alternative chosen for selecting h, between "location", "scale" and "skewness" (only if h is not provided).

object

Object of class kb.test

Value

An S4 object of class kb.test containing the results of the kernel-based quadratic distance tests, based on the normal kernel. The object contains the following slots:

method: Description of the kernel-based quadratic distance test performed.
x Data list of samples X (and Y).
Un The value of the U-statistic.
H0_Un A logical value indicating whether or not the null hypothesis is rejected according to Un.
CV_Un The critical value computed for the test Un.
Vn The value of the V-statistic (if available).
H0_Vn A logical value indicating whether or not the null hypothesis is rejected according to Vn (if available).
CV_Vn The critical value computed for the test Vn (if available).
h List with the value of bandwidth parameter used for the normal kernel function. If select_h is used, the matrix of computed power values and the corresponding power plot are also provided.
B Number of bootstrap/permutation/subsampling replications.
var_Un exact variance of the kernel-based U-statistic.
cv_method The method used to estimate the critical value (one of "subsampling", "permutation" or "bootstrap").

Details

The function kb.test performs the kernel-based quadratic distance tests using the Gaussian kernel with bandwidth parameter h. Depending on the shape of the input y the function performs the tests of multivariate normality, the non-parametric two-sample tests or the k-sample tests.

The quadratic distance between two probability distributions $F$ and $G$ is defined as $$d_{K}(F,G)=\iint K(x,y)d(F-G)(x)d(F-G)(y),$$ where $G$ is a distribution whose goodness of fit we wish to assess and $K$ denotes the Normal kernel defined as $$ K_{{h}}(\mathbf{s}, \mathbf{t}) = (2 \pi)^{-d/2} \left(\det{\mathbf{\Sigma}_h}\right)^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}(\mathbf{s} - \mathbf{t})^\top \mathbf{\Sigma}_h^{-1}(\mathbf{s} - \mathbf{t})\right\},$$ for every $\mathbf{s}, \mathbf{t} \in \mathbb{R}^d \times \mathbb{R}^d$, with covariance matrix $\mathbf{\Sigma}_h=h^2 I$ and tuning parameter $h$.

Test for Normality:
Let $x_1, x_2, ..., x_n$ be a random sample with empirical distribution function $\hat F$. We test the null hypothesis of normality, i.e. $H_0:F=G=\mathcal{N}_d(\mu, \Sigma)$.

We consider the U-statistic estimate of the sample KBQD $$U_{n}=\frac{1}{n(n-1)}\sum_{i=2}^{n}\sum_{j=1}^{i-1} K_{cen}(\mathbf{x}_{i}, \mathbf{x}_{j}),$$ then the first test statistics is $$T_{n}=\frac{U_{n}}{\sqrt{Var(U_{n})}},$$ with $Var(U_n)$ computed exactly following Lindsay et al.(2014), and the V-statistic estimate $$V_{n} = \frac{1}{n}\sum_{i=1}^{n} \sum_{j=1}^{n}K_{cen}(\mathbf{x}_{i}, \mathbf{x}_{j}),$$ where $K_{cen}$ denotes the Normal kernel $K_h$ with parametric centering with respect to the considered normal distribution $G = \mathcal{N}_d(\mu, \Sigma)$.

The asymptotic distribution of the V-statistic is an infinite combination of weighted independent chi-squared random variables with one degree of freedom. The cutoff value is obtained using the Satterthwaite approximation $c \cdot \chi_{DOF}^2$, where $c$ and $DOF$ are computed exactly following the formulas in Lindsay et al.(2014).

For the $U$-statistic the cutoff is determined empirically:
- Generate data from the considered normal distribution ;
- Compute the test statistics for B Monte Carlo(MC) replications;
- Compute the 95th quantile of the empirical distribution of the test statistic.
k-sample test:
Consider $k$ random samples of i.i.d. observations $\mathbf{x}^{(i)}_1, \mathbf{x}^{(i)}_{2},\ldots, \mathbf{x}^{(i)}_{n_i} \sim F_i$, $i = 1, \ldots, k$. We test if the samples are generated from the same unknown distribution, that is $H_0: F_1 = F_2 = \ldots = F_k$ versus $H_1: F_i \not = F_j$, for some $1 \le i \not = j \le k$.
We construct a matrix distance $\hat{\mathbf{D}}$, with off-diagonal elements $$\hat{D}_{ij} = \frac{1}{n_i n_j} \sum_{\ell=1}^{n_i} \sum_{r=1}^{n_j}K_{\bar{F}}(\mathbf{x}^{(i)}_\ell,\mathbf{x}^{(j)}_r), \qquad \mbox{ for }i \not= j$$ and in the diagonal $$\hat{D}_{ii} = \frac{1}{n_i (n_i -1)} \sum_{\ell=1}^{n_i} \sum_{r\not= \ell}^{n_i} K_{\bar{F}}(\mathbf{x}^{(i)}_\ell, \mathbf{x}^{(i)}_r), \qquad \mbox{ for }i = j,$$ where $K_{\bar{F}}$ denotes the Normal kernel $K_h$ centered non-parametrically with respect to $$\bar{F} = \frac{n_1 \hat{F}_1 + \ldots + n_k \hat{F}_k}{n}, \quad \mbox{ with } n=\sum_{i=1}^k n_i.$$

We compute the trace statistic $$\mathrm{trace}(\hat{\mathbf{D}}_n) = \sum_{i=1}^{k}\hat{D}_{ii}$$ and $D_n$, derived considering all the possible pairwise comparisons in the k-sample null hypothesis, given as $$D_n = (k-1) \mathrm{trace}(\hat{\mathbf{D}}_n) - 2 \sum_{i=1}^{k}\sum_{j> i}^{k}\hat{D}_{ij}.$$

We compute the empirical critical value by employing numerical techniques such as the bootstrap, permutation and subsampling algorithms:
- Generate k-tuples, of total size $n_B$, from the pooled sample following one of the sampling methods;
- Compute the k-sample test statistic;
- Repeat B times;
- Select the $95^{th}$ quantile of the obtained values.
Two-sample test:
Let $x_1, x_2, ..., x_{n_1} \sim F$ and $y_1, y_2, ..., y_{n_2} \sim G$ be random samples from the distributions $F$ and $G$, respectively. We test the null hypothesis that the two samples are generated from the same unknown distribution, that is $H_0: F=G$ vs $H_1:F\not=G$. The test statistics coincide with the $k$-sample test statistics when $k=2$.

Kernel centering

The arguments mu_hat and Sigma_hat indicate the normal model considered for the normality test, that is $H_0: F = N($mu_hat, Sigma_hat). For the two-sample and $k$-sample tests, mu_hat and Sigma_hat can be used for the parametric centering of the kernel, in the case we want to specify the reference distribution, with centeringType = "Param". This is the default method when the test for normality is performed. The normal kernel centered with respect to $G \sim N_d(\mathbf{\mu}, \mathbf{V})$ can be computed as $$K_{cen(G)}(\mathbf{s}, \mathbf{t}) = K_{\mathbf{\Sigma_h}}(\mathbf{s}, \mathbf{t}) - K_{\mathbf{\Sigma_h} + \mathbf{V}}(\mathbf{\mu}, \mathbf{t}) - K_{\mathbf{\Sigma_h} + \mathbf{V}}(\mathbf{s}, \mathbf{\mu}) + K_{\mathbf{\Sigma_h} + 2\mathbf{V}}(\mathbf{\mu}, \mathbf{\mu}).$$ We consider the non-parametric centering of the kernel with respect to $\bar{F}=(n_1 F_1 + \ldots n_k F_k)/n$ where $n=\sum_{i=1}^k n_i$, with centeringType = "Nonparam", for the two- and $k$-sample tests. Let $\mathbf{z}_1,\ldots, \mathbf{z}_n$ denote the pooled sample. For any $s,t \in \{\mathbf{z}_1,\ldots, \mathbf{z}_n\}$, it is given by $$K_{cen(\bar{F})}(\mathbf{s},\mathbf{t}) = K(\mathbf{s},\mathbf{t}) - \frac{1}{n}\sum_{i=1}^{n} K(\mathbf{s},\mathbf{z}_i) - \frac{1}{n}\sum_{i=1}^{n} K(\mathbf{z}_i,\mathbf{t}) + \frac{1}{n(n-1)}\sum_{i=1}^{n} \sum_{j \not=i}^{n} K(\mathbf{z}_i,\mathbf{z}_j).$$

Note

For the two- and $k$-sample tests, the slots Vn, H0_Vn and CV_Vn are empty, while the computed statistics are both reported in slots Un, H0_Un and CV_Un.

A U-statistic is a type of statistic that is used to estimate a population parameter. It is based on the idea of averaging over all possible distinct combinations of a fixed size from a sample. A V-statistic considers all possible tuples of a certain size, not just distinct combinations and can be used in contexts where unbiasedness is not required.

References

Markatou, M. and Saraceno, G. (2024). “A Unified Framework for Multivariate Two- and k-Sample Kernel-based Quadratic Distance Goodness-of-Fit Tests.”
https://doi.org/10.48550/arXiv.2407.16374

Lindsay, B.G., Markatou, M. and Ray, S. (2014) "Kernels, Degrees of Freedom, and Power Properties of Quadratic Distance Goodness-of-Fit Tests", Journal of the American Statistical Association, 109:505, 395-410, DOI: 10.1080/01621459.2013.836972

Examples

# create a kb.test object
x <- matrix(rnorm(100),ncol=2)
y <- matrix(rnorm(100),ncol=2)

# Normality test
my_test <- kb.test(x, h=0.5)
my_test
#> 
#>  Kernel-based quadratic distance Normality test 
#> 		U-statistic	V-statistic
#> ------------------------------------------------
#> Test Statistic:	 0.3027069 	 0.6189598 
#> Critical Value:	 1.262023 	 6.071062 
#> H0 is rejected:	 FALSE 		 FALSE 
#> Selected tuning parameter h:  0.5 
#> 

# Two-sample test
my_test <- kb.test(x,y,h=0.5, method="subsampling",b=0.9,
                     centeringType = "Nonparam")
my_test
#> 
#>  Kernel-based quadratic distance two-sample test 
#> U-statistic	 Dn 		 Trace 
#> ------------------------------------------------
#> Test Statistic:	 0.239788 	 0.2819997 
#> Critical Value:	 0.9875162 	 1.162661 
#> H0 is rejected:	 FALSE 		 FALSE 
#> CV method:  subsampling 
#> Selected tuning parameter h:  0.5 
#> 

# k-sample test
z <- matrix(rnorm(100,2),ncol=2)
dat <- rbind(x,y,z)
group <- rep(c(1,2,3),each=50)
my_test <- kb.test(x=dat,y=group,h=0.5, method="subsampling",b=0.9)
my_test
#> 
#>  Kernel-based quadratic distance k-sample test 
#> U-statistic	 Dn 		 Trace 
#> ------------------------------------------------
#> Test Statistic:	 7.325505 	 11.45482 
#> Critical Value:	 0.7402039 	 1.158313 
#> H0 is rejected:	 TRUE 		 TRUE 
#> CV method:  subsampling 
#> Selected tuning parameter h:  0.5 
#>