**Kernel Methods**

Kernel Methods (.pdf)

## Terminology

The term *kernel* is derived from a word that can be traced back to *c.* 1000 and originally meant a seed (contained within a fruit) or the softer (usually edible) part contained within the hard shell of a nut or stone-fruit. The former meaning is now obsolete. It was first used in mathematics when it was defined for integral equations in which the kernel is known and the other function(s) unknown, but now has several meanings in maths. The machine learning term *kernel trick* was first used in 1998.

## Motivation and Description

Firstly, linearity is rather special, and outside quantum mechanics no model of a real system is truly linear. Secondly, detecting linear relations has been the focus of much research in statistics and machine learning for decades and the resulting algorithms are well understood, well developed and efficient. Naturally, one wants the best of both worlds. So, if a problem is non-linear, instead of trying to fit a non-linear model, one can map the problem from the *input space* to a new (higher-dimensional) space (called the *feature space*) by doing a non-linear transformation using suitably chosen basis functions and then use a linear model in the feature space. This is known as the ‘kernel trick’. The linear model in the feature space corresponds to a non-linear model in the input space. This approach can be used in both classification and regression problems. The choice of kernel function is crucial for the success of all kernel algorithms because the kernel constitutes prior knowledge that is available about a task. Accordingly, there is no free lunch (see No Free Lunch Theorems) in kernel choice.

## History

David Hilbert used the German word *kern* in his first paper on integral equations (Hilbert 1904).

The mathematical result underlying the kernel trick, Mercer’s theorem, is almost a century old (Mercer 1909). It tells us that any ‘reasonable’ kernel function corresponds to *some* feature space.

The underlying mathematical results that allow us to determine which kernels can be used to compute distances in feature spaces was developed by Schoenberg (1938).

The methods for representing kernels in linear spaces were first studied by Kolmogorov (1941) for a countable input domain.

The method for representing kernels in linear spaces for the general case was developed by Aronszajn (1950).

Dunford and Schwartz (1963) showed that Mercer’s theorem also holds true for general compact spaces.

The use of Mercer’s theorem for interpreting kernels as inner products in a feature space was introduced into machine learning by Aizerman, Braverman and Rozonoer (1964).

The justification for a non-linear transformation followed by a linear transformation can be traced back to Cover (1965).

The idea of polynomial kernels stems from Poggio (1975).

Berg, Christensen and Ressel (1984) published a good monograph on the theory of kernels.

Micchelli (1986) discussed closure properties when making kernels.

Saitoh (1988) showed the connection between positivity (a ‘positive matrix’ defined in Aronszajn (1950)) and the positive semi-definiteness of all finite set kernel matrices.

Reproducing kernels were extensively used in machine learning and neural networks by Poggio and Girosi, see for example Poggio and Girosi (1990), a paper on radial basis function networks. The theory of kernels was used in approximation and regularisation theory, and the first chapter of *Spline Models for Observational Data* (Wahba 1990) gives a number of theoretical results on kernel functions.

In a seminal paper, Boser, Guyon and Vapnik (1992) (re)introduced the notion of a kernel into the mainstream of the machine learning literature by combining kernel functions with large margin hyperplanes, leading to support vector machines. They discussed Gaussian and polynomial kernels.

*ANOVA kernels* were first suggested by Burges and Vapnik (1995) (under the name *Gabor kernels*). FitzGerald, Micchelli and Pinkus (1995) studied various notions of multivariate functions which map families of positive semidefinite matrices or of conditionally positive semidefinite matrices into matrices of the same type.

Scholköpf, Smola and Müller (1996) used kernel functions to perform principal component analysis.

For a very readable book covering kernels for strings and trees, see Gusfield (1997). Scholköpf (1997) observed that *any* algorithm which can be formulated solely in terms of dot products can be made non-linear by carrying it out in feature spaces induced by Mercer kernels. Scholköpf, Smola and Müller (1997) presented their paper on kernel PCA.

Smola, Scholköpf and Müller (1998) make the connection between regularization operators and support vector kernels. An early survey of the modern usage of kernel methods in pattern analysis can be found in Burges (1998). For another readable book on kernels for strings and trees, see Durbin, *et al.* (1998). Graepel and Obermayer (1998) proposed a topographic clustering algorithm based on kernel functions. Williams (1998) gives a tutorial on regression with Gaussian processes. Smola and Scholköpf (1998) generalized the support vector approach to a wider range of cost functions, and established a link between regularization operators and support vector kernels. Vapnik (1998) described *recursive ANOVA kernels*. Joachims (1998) proposed *bag-of-word kernels*, which can be considered as an example of kernels between sets. MacKay (1998) gives an introduction to Gaussian processes. Scholköpf, Burges and Smola (1998) edited *Advances in Kernel Methods: Support Vector Learning*, which is a collection of papers submitted during a workshop on SVMs held at the 1997 annual Neural Information Processing Systems (NIPS) conference. Scholköpf, Smola and Müller (1998) published their work on kernel principal component analysis. Stitson, *et al.* (1998) used ANOVA decomposition kernels to good effect in a multi-dimensional support vector regression problem. Wahba (1998) gives a survey of work on reproducing kernel Hilbert spaces.

Watkins (1999) proposed the use of probabilistic context-free grammars to build kernels between sequences. Evgeniou, Pontil and Poggio (1999) present a unified framework for regularization networks and SVMs. Cristianini, Campbell and Shawe-Taylor (1999) presented an algorithm which automatically learns the kernel parameter from the data. Jaakkola and Haussler (1999) proposed using a hidden Markov model to evaluate a kernel between biosequences, where the feature vector is the Fisher score of the distribution; they introduced the *Fisher kernel*. Jaakkola and Haussler (1999) derived a generic class of probabilistic regression models and a parameter estimation technique that can make use of arbitrary kernel functions. Amari and Wu (1999) described a method for modifying a kernel function in order to affect the geometry in the input space, so that the separability of the data is improved. Haussler (1999) proposed a new method of constructing kernels on sets whose elements are discrete structures like strings, trees and graphs. He introduced *P*-kernels. Watkins (1999b) developed a string subsequence kernel by means of recursion.

Cristianini and Shawe-Taylor (2000) published *An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods* which includes a chapter on Kernel-Induced Feature Spaces. Hofmann (2000) used the Fisher kernel for learning the similarity between text documents. Watkins (2000) showed that the scores produced by certain dynamic alignment algorithms for sequences are in fact valid kernel functions. Zien, *et al.* (2000) showed how to incorporate prior biological knowledge by engineering an appropriate kernel function, and developed a *locality-improved kernel*. Oliver, Scholköpf and Smola (2000) provided a regularization-theoretic analysis of a class of SV kernels—called *natural kernels*—based on generative models with density *p*(**x**|θ), such as the Fisher kernel.

Sim (2001) developed a kernel for text categorization which looks only at pairs of words within a certain vicinity with respect to each other. Bartlett and Scholköpf (2001) consider some kernels for structured data. Williamson, Smola and Scholköpf (2001) derive new bounds for the generalization error of kernel machines. Steinwart (2001) considered the influence of the kernel on the consistency of SVMs and developed the concept of *universal kernels*. Herbrich (2001) published the book *Learning Kernel Classifiers: Theory and Algorithms* which provided the first comprehensive overview of both the theory and algorithms of kernel classifiers. Scholköpf and Smola (2001) wrote the book *Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*, which provides an introduction to SVMs and related kernel methods, and may appeal to mathematicians.

Lodhi, *et al.* (2002) were the first to apply string kernels, with an application to text categorisation. Leslie, Eskin and Noble (2002) introduced a new sequence-similarity (string) kernel, the *spectrum kernel*, for use with SVMs in a discriminative approach to the protein classification problem. Vert (2002b) defined a class of kernels for strings based on simple probabilistic models. In a text classification problem, Joachims (2002) proposed using the vector space model as a kernel. Girolami (2002) applied the kernel trick to clustering in feature space. Gärtner, *et al.* (2002) introduced *multi-instance kernels*, kernels for multi-instance problems—a class of concepts on individuals represented by sets. Kondor and Lafferty (2002) proposed *diffusion kernels*. Tsuda, Kin and Asai (2002) proposed marginalised kernels, which provide a way to design a kernel from latent variables. Vert (2002a) introduced the *tree kernel*. Takimoto and Warmuth (2002a) introduced the *all subsets kernel* as well as the idea of kernels based on paths in a graph. In an application of kernel methods to natural language processing problems, Collins and Duffy (2002) presented the dynamic programming method for comparing two trees in the context of language parsing. Steinwart (2002) showed that SVMs of the 1-norm soft margin type are universally consistent provided that the regularization parameter is chosen in a distinct manner and the kernel belongs to the class of universal kernels.

Jordan (2003) wrote an introduction to probabilistic graphical models. The survey paper by Gärtner (2003) describes several approaches to defining positive definite kernels on structured data. Cortes, Haffner and Mohri (2003a) discuss positive definite rational kernels. Jebara and Kondor (2003) introduced a class of kernels between probability distributions, the *probability product kernel*, which eschews some of the complexities that kernels based on the Kullback-Leibler divergence often contend with. The *Bhattacharyya kernel* and the *expected likelihood kernel* are special cases. Leslie and Kuang (2003) introduced several new families of string kernels—*restricted gappy kernels*, *substitution kernels* and *wildcard kernels*—designed in particular for use with SVMs for classification of protein sequence data. Smola and Kondor (2003) introduced a family of kernels on graphs based on the notion of regularization operators. Cortes, Haffner and Mohri (2003b) introduced a general family of kernels based on weighted transducers or rational relations, *rational kernels*. Borrowing ideas and techniques from information theory and data compression, Cuturi and Vert (2003) proposed a covariance kernel for biological sequences. Kandola, Shawe-Taylor and Cristianini (2003) introduced the *von Neumann kernel*. Leslie, *et al.* (2003) introduced a class of string kernels called *mismatch kernels*. Saunders, Shawe-Taylor and Vinokourov (2003) showed how string kernels can be treated as Fisher kernels. Vishwanathan and Smola (2003) described novel methods of computation using suffix trees. Wolf and Shashua (2003) considered the problem of learning with instances defined over a space of sets of vectors. They derived a positive definite kernel defined over pairs of matrices based on the concept of principal angles between two linear subspaces.

Gert, *et al.* (2004) show how the kernel matrix can be learned from data via semidefinite programming techniques. Kashima, Tsuda and Inokuchi (2004b) introduced a kernel function between two labelled graphs in the framework of marginalized kernels. Leslie, *et al.* (2004) introduced a class of string kernels, called *mismatch kernels*. Pinkus (2004) proved a result concerning conditions in which a function is strictly positive definite. Shawe-Taylor and Cristianini (2004) wrote *Kernel Methods for Pattern Analysis*, which provides practitioners with a large toolkit of algorithms, kernels and solutions ready to be implemented, and also serves as an introduction for students and researchers to kernel-based pattern analysis. The book may appeal to those with a computer science bent. Kashima, Tsuda and Inokuchi (2004a) discussed the construction of kernel functions between labelled graphs and provide a unified account of a family of kernels called *label sequence kernels*. Vert, Saigo and Akutsu (2004) used pair hidden Markov models as kernels. Vishwanathan and Smola (2004) presented a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynamic programming with quadratic time complexity.

Hein, Bousquet and Scholköpf (2005) built a general framework for the generation of maximal margin algorithms for metric spaces. Rasmussen and Williams (2005) provide a systematic and unified treatment of theoretical and practical aspects of Gaussian processes in machine learning.

Borgwardt, *et al.* (2006) proposed a kernel-based statistical test of whether two samples are from the same distribution.

Vishwanathan, Smola and Vidal (2007) proposed a family of kernels based on the Binet-Cauchy theorem, and its extension to Fredholm operators. Their derivation provides a unifying framework for all kernels on dynamical systems currently used in machine learning, including kernels derived from the behavioural framework, diffusion processes, marginalized kernels, kernels on graphs and the kernels on sets arising from the subspace angle approach. Bakir, *et al.* (2007) edited *Predicting Structured Data*.

Filippone, *et al.* (2008) give a a survey of kernel and spectral methods for clustering, and prove that the two paradigms have the same objective. Hofmann, Scholköpf and Smola (2008) review machine learning methods employing positive definite kernels.

## Advantages

- The kernel defines a similarity measure between two data points and thus allows one to incorporate prior knowledge of the problem domain.
- Most importantly, the kernel contains all of the information about the relative positions of the inputs in the feature space and the actual learning algorithm is based only on the kernel function and can thus be carried out without explicit use of the feature space. The training data only enter the algorithm through their entries in the kernel matrix (a Gram matrix), and never through their individual attributes. Because one never explicitly has to evaluate the feature map in the high dimensional feature space, the kernel function represents a computational shortcut.
- The number of operations required is not necessarily proportional to the number of features.

## Links

- Fisher Kernels
- Wikipedia: Kernel
- Wikipedia: Kernel (set theory)
- Wikipedia: Kernel (mathematics)
- Wikipedia: Kernel methods
- Wikipedia: Kernel trick
- Kernel Machines