L. Vandenberghe. ECEA (Fall ). Cholesky factorization. • positive definite matrices. • examples. • Cholesky factorization. • complex positive definite . This article aimed at a general audience of computational scientists, surveys the Cholesky factorization for symmetric positive definite matrices, covering. Papers by Bunch  and de Hoog  will give entry to the literature. occur quite frequently in some applications, so their special factorization, called Cholesky.
|Published (Last):||25 February 2016|
|PDF File Size:||2.73 Mb|
|ePub File Size:||11.59 Mb|
|Price:||Free* [*Free Regsitration Required]|
Alexey FrolovVadim Voevodin Section 2. The Cholesky decomposition algorithm was first proposed by Andre-Louis Cholesky October 15, – August 31, at the end of the First World War shortly before he was killed in battle.
He was a French military officer and mathematician. The choleskyy of this algorithm was published in by his fellow officer  choleksy, later, was used by Banachiewicz in  . In the Russian mathematical literature, the Cholesky decomposition is also known as the square-root method    due to the square root operations used in this decomposition and not used in Gaussian elimination.
Originally, the Cholesky decomposition was used only for dense real symmetric positive definite matrices. At present, the application of this decomposition is much wider. For example, it can also be employed for the case of Hermitian matrices.
In order to increase the computing performance, its block versions are often applied. In the case of sparse matrices, algorithke Cholesky decomposition is also widely used as the main stage of a direct method for solving linear systems.
In order to reduce the memory requirements and the profile of the matrix, special reordering strategies are applied to minimize the number of arithmetic cholseky. A number of reordering strategies are used to identify the independent matrix blocks for parallel computing systems. Various versions of the Cholesky decomposition are successfully used in iterative methods to construct preconditioners for sparse symmetric positive definite matrices. In the case of incomplete triangular decomposition, the elements of a preconditioning matrix are specified only in predetermined positions for example, in the positions of the nonzero elements; this version is known as the IC0 decomposition.
In order to construct a more accurate decomposition, a filtration of small elements is performed using a filtration threshold. The use of such a threshold allows one to obtain an accurate decomposition, but the number of nonzero elements increases. A decomposition algorithm of second-order accuracy is discussed in  ; this algorithm retains the number of nonzero elements in the factors of the decomposition and allows one to increase the accuracy.
In its parallel implementation, a special version of an additive preconditioner is applied on the basis of the second-order decomposition . Here we consider the original version of the Cholesky decomposition for dense real symmetric positive definite matrices. For a number of other versions, however, the structure of this decomposition is almost the same for example, for the complex chplesky A list of other basic versions of the Cholesky decomposition is available on the page Cholesky method.
These forms of the Cholesky decomposition are equivalent in the sense of the amount of arithmetic operations and are different in the sense of data represntation. The Cholesky decomposition is widely used due to the following features. The symmetry of a matrix allows algrithme to store in computer memory slightly more than half the number of its elements and to reduce the number of operations by a factor of two compared to Gaussian elimination.
Note that the LU-decomposition does not require the square-root operations when using the property of symmetry and, hence, is somewhat faster than the Cholesky decomposition, but requires to store the entire matrix.
The Cholesky decomposition allows one to use the so-called accumulation mode due to the fact that the significant part of computation involves dot product operations. Hence, these dot products can be accumulated in double precision for additional accuracy. In this mode, the Cholesky method has the least equivalent perturbation. During the process of decomposition, no growth of the matrix elements can occur, since the matrix is symmetric and positive definite.
Thus, the Cholesky algorithm is unconditionally stable. Here we do not consider this computational scheme, since this scheme has worse parallel characteristics than that given above. Hence, the following operation should be considered as a computational kernel instead of the dot product operation:. In the BLAS-based implementations, thus, the computational kernel of the Cholesky algorithm consists algoritme dot products.
In the accumulation mode, the multiplication and subtraction operations should be made in double precision or by using the corresponding function, like the DPROD function in Fortranwhich increases the overall computation time of the Cholesky algorithm.
The graph of the algorithm    consists of three groups of vertices positioned in the integer-valued nodes of three domains of different dimension. The first group of vertices choledky to the one-dimensional domain corresponding to the SQRT operation.
The coordinates of this domain are as follows:. The above graph is illustrated in Figs. In these figures, the vertices of the first group are highlighted in yellow and are marked by the letters SQ; the vertices of the second group are highlighted in green and are marked by the division sign; the vertices of the third group are highlighted in red and are marked by the letter F.
The vertices corresponding to the results of operations output data are marked by large circles. The arcs doubling one another are depicted as a single one. The representation of the graph shown in Fig. The graph of Fig. Contrary to a serial version, in a parallel version the square-root and division operations require a significant part of overall computational time.
The existence of isolated square roots on some layers of the parallel form may cause other difficulties for particular parallel computing architectures. In the case of symmetric linear systems, the Cholesky decomposition is preferable compared to Gaussian elimination because of the reduction in cholessky time by a factor of two.
However, this is not true in the case of its parallel version. In addition, we should mention the fact that the accumulation mode requires multiplications and subtraction in double precision. In a parallel version, this means that almost all intermediate computations should be performed with data given in their double precision format.
Contrary to a serial version, hence, this almost doubles the memory expenditure. Thus, the Cholesky decomposition belongs to the class of algorithms of linear complexity in the sense of the height of its parallel form, whereas its complexity is quadratic in the sense of the width of its parallel form.
Amount of input data: In practice, this storage saving scheme can be implemented in various ways. Amount of output data: In the case of unlimited computer resources, the ratio of the serial complexity to the parallel complexity is quadratic.
The computational power of the Cholesky algorithm considered as the ratio of the number of operations to the amount of input and output data is only linear. The Cholesky is almost cho,esky deterministic, which is ensured by the uniqueness theorem for this particular decomposition. Another order of algoriithme operations may lead to the accumulation of round-off errors; however, the effect of this accumulation is not so large as in the case of not using the accumulation mode when computing dot products.
The information graph arcs from the vertices corresponding to the square-root and division operations can be considered as groups of data such that the function relating the multiplicity of these vertices and the number of these operations is a linear function of the matrix order and the vertex coordinates.
The most cholsky is the compact packing of a graph in the form of its projection onto the matrix triangle whose elements are recomputed by the packed operations. Such a slow growth of cholrsky elements during decomposition is due to the choleaky that the matrix is symmetric and positive definite.
In its simplest version without permuting the summation, the Cholesky decomposition can be represented in Fortran as. A block version of the Cholesky algorithm is usually choleskh in such a way that the scalar operations in its serial versions are replaced by the corresponding block-wise operations instead of using the loop unrolling and reordering techniques.
In order to ensure the locality of memory access in the Cholesky algorithm, in its Fortran implementation the original matrix and its decomposition are stored in the upper triangle instead of the lower triangle.
The efficiency of such a version can be explained by the fact that Fortran stores matrices by columns and, hence, the computer programs in which the inner loops go up or down a column generate serial access to memory, contrary to the non-serial access when the inner loop goes across a row.
This column orientation provides a significant improvement on computers with paging and cache memory. There exists the following dot version of the Cholesky decomposition: This version can be illustrated as follows:.
As can be seen from the above program fragment, the array to store the original matrix and the output data should be declared as double precision for the accumulation mode. Note that the graph of the algorithm for this fragment and for the previous one is almost the same the only distinction is that the DPROD function is used instead of multiplications. A memory access profile   is illustrated in Fig.
In this profile, hence, only the elements of this array are referenced. The above-illustrated implementation consists of a single main stage; in its turn, this stage consists of a sequence of similar iterations. An example of such an iteration is highlighted in green. We should also note that, at each iteration, the number of memory accesses increases up to the middle of the algorithm; after that, this number decreases down to the end of the algorithm.
This fact allows us to conclude that the data processed by the algorithm are used nonuniformly and that many iterations especially at the beginning of the process use a large amount of data, which decreases the memory access locality. In this case, however, the structure of iterations is the main factor influencing the memory access locality. The first fragment is the serial access to the addresses starting with a certain initial address; each element of the working array is rarely referenced.
This fragment possesses a good spatial locality, since the step in memory between the adjacent memory references is not large; however, its temporal locality is bad, since the data are rarely reused.
The locality of the second fragment is much better, since a large number of references are made to the same data, which ensures a large degree of spatial and temporal locality than that of the first fragment. We can also estimate the overall locality of these two fragments for each iteration.
However, it is reasonable to consider the structure of each fragment in more detail. In particular, each step of fragment 1 consists of several references to adjacent addresses and the memory access is not serial.
Fragment 2 consists of repetitive iterations; each step of fragment 1 corresponds to a single iteration of fragment 2 highlighted in green in Fig. This fact indicates that, in order to exactly understand the local profile structure, it is necessary to consider this profile on the altorithme of individual references. It should be noted that the conclusions made on the basis of Fig. The main fragment of the algoritgme used to obtain the quantitative estimates is given here the Kernel function.
The startup conditions are discussed here. The first estimate is made on the basis of the daps characteristic used to evaluate the number of write and read operations per second. This characteristic is similar to the flops estimate for memory access and is an estimate of the memory usage performance rather algorithem an estimate of locality. However, the daps algorihhme is a good information source and can be used to compare with the results obtained according to the cvg characteristic.
The values of this characteristic are given in increasing order: From this figure it follows that the Cholesky algorithm is characterized by a sufficiently large rate of memory usage; however, this rate is lower than that of the LINPACK test or the Jacobi method. The cvg characteristic is used to obtain a more machine-independent estimate of locality and to specify the frequency of fetching data zlgorithme the cache memory. A lesser value of cvg corresponds to a higher level of locality and to a smaller number of the above fetching procedure.
Matlab program for Cholesky Factorization
These values are given in decreasing order: From this figure it follows that the Cholesky algorithm occupies a lower position than it has in the performance list given in Fig. Nevertheless, a simple parallelization technique causes a large number of data transfer between the processors at each step of the outer loop; this number is almost comparable with the number of arithmetic operations. Hence, it is reasonable to partition the computations into blocks with the corresponding partitioning of the data arrays before the allocation of operations and data between the processors of the computing system in use.