KAN: Kolmogorov–Arnold Networks: A review

Vikas Dhiman

Why review this?

On Apr 30, 2024, [5] appears on ArXiV and by May 7th, I have heard about this paper from multiple students, from whom I do not hear about new papers. It must be special, I thought. I decided to take a look.

If I was professionally reviewing this paper, I would accept the paper with major revisions. The paper has enough contributions to deserve a publication. But some of the claims need to be toned down, interpretations need to be clarified and comparisons with spline based neural networks be made.

Outline

I make 4 major critques of the paper

1.

MLPs have learnable activation functions as well
2.

The content of the paper does not justify the name, Kolmogorov-Arnold networks (KANs).
3.

KANs are MLPs with spline-basis as the activation function.
4.

KANs do not beat the curse of dimensionality.

MLPs have learnable activation functions as well

The authors claim in the abstract,

While MLPs have fixed activation functions on nodes (“neurons”), KANs have learnable activation functions on edges (“weights”). KANs have no linear weights at all – every weight parameter is replaced by a univariate function parametrized as a spline.

This is not a helpful description because one can interpret MLPs as having “learnable activation functions” as well; it depends on the definition what you call the “activation function“. Consider a two layer MLP with input $\mathbf{x}\in\mathbb{R}^{n}$ , weights $W_{1}$ , $W_{2}$ (ignore biases for now) and activation function $\sigma$ ,

\displaystyle f(\mathbf{x})

\displaystyle=W_{2}\sigma(W_{1}\mathbf{x})=W_{2}\phi_{1}(\mathbf{x}).

(1)

If I define $\phi_{1}(\mathbf{x})=\sigma(W_{1}\mathbf{x})$ and call $\phi_{1}(.)$ as the activation function, then I have a learnable activation function in an MLP. Same with Figure 0.1, it is a reinterpretation, not redesign of MLPs as claimed.

What’s in the name

How do KAN’s actually use Kolmogorov-Arnold Theorem (KAT)? The theorem is not actually useful in the development of KANs. KANs are only inspired by KAT not based on it.

So what is Kolmogorov-Arnold Theorem? The paper describes it as the decomposition of any smooth function $f:[0,1]^{n}\to\mathbb{R}$ in terms of finite basis function $\phi^{(2)}_{q}:\mathbb{R}\to\mathbb{R}$ ¹¹ 1 Slight change in notation from the paper and $\phi_{q,p}:[0,1]\to\mathbb{R}$ .

\displaystyle f(\mathbf{x})=f(x_{1},\dots,x_{n})=\sum_{q=1}^{{\color[rgb]{% 1,0,0}2n+1}}\phi^{(2)}_{q}\left(\sum_{p=1}^{n}\phi_{p,q}(x_{p})\right).

(2)

If you plan to use Kolmogorov-Arnold Theorem (KAT), you have understand the central claim of the KAT theorem and how is this theorem different the nearest competitor (Universal Approximation Theorem (UAT) ). Universal approximation theorem states that any function can be approximated by a wide enough 2-layer neural network.

\displaystyle f(\mathbf{x})

\displaystyle=\sum_{q=1}^{{\color[rgb]{1,0,0}\infty}}w^{(2)}_{q}\sigma\left(% \sum_{p=1}^{n}w^{(1)}_{q,p}x_{p}\right)

\displaystyle\text{ where }W_{2}=[w^{(2)}_{q}]_{q=1}^{\infty}\text{ and }W_{1}% =[[w^{(1)}_{q,p}]_{q=1}^{\infty}]_{p=1}^{n}

(3)

I wrote the MLP in terms of summation instead of matrix multiplication to draw parallels between UAT and KAT. There are two main differences between UAT and KAT,

1.

UAT deals with linear layers with common activation function (like sigmoid [3], ReLU, tanh) while KAT deals with arbitrary functions, possibly “non-smooth and even fractal”.
2.

UAT needs possibly infinite hidden units for exact approximation while KAT only needs $2n+1$ “hidden units”.

I would claim that central point of KAT is about needing only $2n+1$ hidden units, otherwise it is a weaker theorem than UAT. Does the KAN paper make use of the $2n+1$ hidden units consistently? No. But they justify rest of the paper to be based on KAT by saying,

However, we are more optimistic about the usefulness of the Kolmogorov-Arnold theorem for machine learning. First of all, we need not stick to the original Eq. (2.1) which has only two-layer non- linearities and a small number of terms (2n + 1) in the hidden layer: we will generalize the network to arbitrary widths and depths.

Okay. But aren’t we back to Universal approximation theorem then?

There is one aspect of KAT that the authors highlight, “In a sense, they showed that the only true multivariate function is addition, since every other function can be written using univariate functions and sum.” This is a cool interpretation, but this interpretation does not separate KAT from UAT that is already being used in MLPs.

KANs are MLPs with spline-based activation functions

In practice, the authors end up proposing a KAN residual layer whose each scalar function is written as,

	$\displaystyle\phi(x)$	$\displaystyle=w(\text{silu}(x)+\text{spline}(x))$		(4)
	$\displaystyle\text{spline}(x)$	$\displaystyle=\sum_{i=1}^{G}c_{i}B_{i}(x).$		(5)

What are splines?²² 2 https://personal.math.vt.edu/embree/math5466/lecture10.pdf For the purpose of this section you do not need to know splines. By the way there exists a history of splines in neural networks which is not cited in this paper [2, 1]. For now, assume splines are functions that are a result of a linear combination $c_{i}B_{i}(x)$ of a particular kind basis functions $B_{i}(x)$ (B-form splines). To reinterpret this scalar function as a MLP, let’s rewrite this as,

$\displaystyle\phi(x)$	$\displaystyle=\sum_{i=1}^{k}c_{i}B_{i}(x)+\text{selu}(x)$	(6)
	$\displaystyle=\underbrace{\begin{bmatrix}\hidden@noalign{}\hfil\textstyle wc_{% 1}&wc_{2}&\dots&wc_{k}&w\end{bmatrix}}_{\mathbf{w}^{\top}}\underbrace{\begin{% bmatrix}\hidden@noalign{}\hfil\textstyle B_{1}(x)\\ \hidden@noalign{}\hfil\textstyle B_{2}(x)\\ \hidden@noalign{}\hfil\textstyle\vdots\\ \hidden@noalign{}\hfil\textstyle B_{G}(x)\\ \hidden@noalign{}\hfil\textstyle\text{selu}(x)\end{bmatrix}}_{\mathbf{b}(x)}$	(13)
	$\displaystyle=\mathbf{w}^{\top}\mathbf{b}(x).$	(14)

Here $\mathbf{w}$ contains the learnable parameters of the splines and $\mathbf{b}(x)$ is deterministic once the spline grid is fixed though it can be made learnable. Let’s put this back into (2),

	$\displaystyle f(\mathbf{x})=f(x_{1},\dots,x_{n})$	$\displaystyle=\sum_{q=1}^{2n+1}\phi^{(2)}_{q}\left(\sum_{p=1}^{n}\phi^{(1)}_{p% ,q}(x_{p})\right)$		(15)
		$\displaystyle=\sum_{q=1}^{2n+1}\mathbf{w}^{(2)\top}_{q}\mathbf{b}\left(\sum_{p% =1}^{n}\mathbf{w}^{(1)\top}_{p,q}\mathbf{b}(x_{p})\right).$		(16)

This is very close to an MLP if we consider $\mathbf{w}$ s linear weights and the basis functions as activation functions, with the following differences,

1.

The activation function $\mathbf{b}()$ is applied on the input side which is typically not a part of MLPs. However, it has been common to convert input into a set of feature vectors as a pre-processing step rather than providing MLPs the raw input.
2.

Unlike $w^{(1)}_{p,q}$ being scalar in (3), $\mathbf{w}^{(1)}_{p,q}$ is a vector in (16). This is not a problem because it is still a linear combination of processed input values through a basis function $\mathbf{b}(x)$ . To make this explicit, we write (16) as a matrix vector multiplication followed by activation functions.

To write (16) as a matrix vector product, consider only the first layer term ,

\displaystyle\sum_{p=1}^{n}\mathbf{w}^{(1)\top}_{p,q}\mathbf{b}(x_{p})=% \underbrace{\begin{bmatrix}\hidden@noalign{}\hfil\textstyle\mathbf{w}^{(1)\top% }_{1,1}&\dots&\mathbf{w}^{(1)\top}_{1,n}\\ \hidden@noalign{}\hfil\textstyle\mathbf{w}^{(1)\top}_{2,1}&\dots&\mathbf{w}^{(% 1)\top}_{2,n}\\ \hidden@noalign{}\hfil\textstyle\vdots&\ddots&\vdots\\ \hidden@noalign{}\hfil\textstyle\mathbf{w}^{(1)\top}_{2n+1,1}&\dots&\mathbf{w}% ^{(1)\top}_{2n+1,n}\\ \end{bmatrix}_{(2n+1)\times nG}}_{\mathbf{W}^{(1)}}\underbrace{\begin{bmatrix}% \hidden@noalign{}\hfil\textstyle\mathbf{b}(x_{1})\\ \hidden@noalign{}\hfil\textstyle\vdots\\ \hidden@noalign{}\hfil\textstyle\mathbf{b}(x_{n})\end{bmatrix}_{nG\times 1}}_{% \mathbf{B}(\mathbf{x})}.

(24)

You can apply this interpretation repeatedly,

\displaystyle f(\mathbf{x})=\mathbf{W}^{(2)}\mathbf{B}\left(\mathbf{W}^{(1)}% \mathbf{B}(\mathbf{x})\right),

(25)

where $\mathbf{x}\in\mathbb{R}^{n},\mathbf{B}:\mathbb{R}^{n}\to\mathbb{R}^{nG},% \mathbf{W}^{(1)}\in\mathbb{R}^{(2n+1)\times nG},\mathbf{W}^{(2)}\in\mathbb{R}^% {1\times(2n+1)G}$ .

Here $\mathbf{B}(\mathbf{x})$ is unlike other activation functions. Instead of producing a scalar from a scalar, it produces $G$ different values for each scalar value in the input.

The claim that KANs beat the curse of dimensionality is wrong

The authors claim that,

KANs with finite grid size can approximate the function well with a residue rate independent of the dimension, hence beating curse of dimensionality!

This is a huge claim and requires huge evidence. As outlined in the previous section, if all KANs can be written as MLPs then either both MLPs and KANs beat the curse of dimensionality or neither does.

My first objection is how ”curse of dimensionality” is interpreted. Typically curse of dimensionality in machine learning is measured by the amount of data needed for training a function to a desired error.

I do not understand the proof of Theorem 2.1, especially the first step. It is not clear from what theorem in [4] the first result follows. If page number or chapter is provided that would be great.

It is also counter intuitive because a singular grid size $G$ is assumed for all the $n$ input dimensions. What would the bound look like if each dimension of $x$ is divided into different grid sizes.

References

[1] S. Aziznejad, H. Gupta, J. Campos, and M. Unser (2020) Deep neural networks with trainable activations and controlled lipschitz constant. IEEE Transactions on Signal Processing 68 (), pp. 4688–4699. External Links: Document Cited by: KANs are MLPs with spline-based activation functions.
[2] P. Bohra, J. Campos, H. Gupta, S. Aziznejad, and M. Unser (2020) Learning activation functions in deep (spline) neural networks. IEEE Open Journal of Signal Processing 1 (), pp. 295–309. External Links: Document Cited by: KANs are MLPs with spline-based activation functions.
[3] G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: item 1.
[4] C. de Boor (2001) A practical guide to splines. Applied Mathematical Sciences, Springer New York. External Links: ISBN 9780387953663, LCCN 20049644, Link Cited by: The claim that KANs beat the curse of dimensionality is wrong.
[5] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, and M. Tegmark (2024) KAN: kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756. Cited by: Why review this?.