Algorithm storage and computation

There are 4 things to keep in mind when choosing or designing an algorithm for matrix computations:

Memory Use
Speed
Accuracy
Scalability/Parallelization

Often there will be trade-offs between these categories.

Motivation: Matrices are everywhere-- anything that can be put in an Excel spreadsheet is a matrix, and language and pictures can be represented as matrices as well. Knowing what options there are for matrix algorithms, and how to navigate compromises, can make enormous differences to our solutions. For instance, an approximate matrix computation can often be thousands of times faster than an exact one. Knowing how the algorithms really work helps to both debug and accelerate our solution. Matrix Computations There are two key types of matrix computation, which get combined in many different ways.

Matrix and tensor products
Matrix decomposition's

So basically we're going to be combining matrices, and pulling them apart again!

“Math is continuous & infinite, but computers are discrete & finite.”

Two Limitations of computer representations of numbers:

they can't be arbitrarily large or small
there must be gaps between them

The reason we need to care about accuracy, is because computers can't store infinitely accurate numbers. It's possible to create calculations that give very wrong answers (particularly when repeating an operation many times, since each operation could multiply the error).

Now, How computers store numbers:

IEEE Double precision arithmetic: Numbers can be as large as 1.79 × 103081.79 × 10308 and as small as 2.2310−3082.23×10−308.

The interval [1,2][1,2] is represented by discrete subset: 1, 1+2−52, 1+2×2−52, 1+3×2−52, …, 2 1, 1+2−52, 1+2×2−52, 1+3×2−52, …, 2

Machine Epsilon Half the distance between 1 and the next larger number. This can vary by computer. IEEE standards for double precision specify

Εmachine = 2−53 ≈ 1.11×10−16 εmachine= 2−53 ≈ 1.11×10−16

Two important properties of Floating Point Arithmetic: The difference between a real number x and its closest floating point approximation fl(x)fl(x) is always smaller than εmachineεmachine in relative terms. For some εε , where ∣ε∣ ≤ εmachine ∣ε∣ ≤ εmachine , fl(x) = x⋅(1+ε) fl(x) = x⋅(1+ε)

Where * is any operation ( +,−,×,÷+,−,×,÷ ), and ⊛⊛ is its floating point analogue, x⊛y = (x∗y)(1+ε) x⊛y = (x∗y)(1+ε) for some εε , where ∣ε∣≤εmachine∣ε∣≤εmachine That is, every operation of floating point arithmetic is exact up to a relative error of size at most εmachineεmachine.

Since we cannot represent numbers exactly on a computer (due to the finiteness of our storage, and the gaps between numbers in floating point architecture), it becomes important to know how small perturbations in the input to a problem impact the output.

"A stable algorithm gives nearly the right answer to nearly the right question." --Trefethen

Conditioning: perturbation behavior of a mathematical problem (e.g. least squares)

Stability: perturbation behavior of an algorithm used to solve that problem on a computer (e.g. least squares algorithms, householder, back substitution, gaussian elimination)

Expensive Errors: The below examples are from Greenbaum & Chartier. European Space Agency spent 10 years and $7 billion on the Ariane 5 Rocket. What can happen when you try to fit a 64 bit number into a 16 bit space (integer overflow)

Sparse vs Dense: Now we know, how numbers are stored, now let's talk about how matrices are stored. A key way to save memory (and computation) is not to store all of your matrix. Instead, just store the non-zero elements. This is called sparse storage, and it is well suited to sparse matrices, that is, matrices where most elements are zero.

Speed differences come from a number of areas, particularly:

Computational complexity
Vectorization
Scaling to multiple cores and nodes
Locality
Computational complexity

Algorithms are generally expressed in terms of computation complexity with respect to the number of rows and number of columns in the matrix. E.g. you may find an algorithm described as (n2m)O(n2m) {Big O Notation}

Vectorization: Modern CPUs and GPUs can apply an operation to multiple elements at once on a single core. For instance, take the exponent of 4 floats in a vector in a single step. This is called SIMD. we will not be explicitly writing SIMD code (which tends to require assembly language or special C "intrinsics"), but instead will use vectorized operations in libraries like numpy, which in turn rely on specially tuned vectorized low level linear algebra APIs (in particular, BLAS, and LAPACK).

Locality Using slower ways to access data (e.g. over the internet) can be up to a billion times slower than faster ways (e.g. from a register). But there's much less fast storage than slow storage. So once we have data in fast storage, we want to do any computation required at that time, rather than having to load it multiple times each time we need it. In addition, for most types of storage its much faster to access data items that are stored next to each other, so we should try to always use any data stored nearby. These two issues are known as locality.

Speed of different types of memory Here are some numbers everyone should know (from the legendary Jeff Dean):

L1 cache reference 0.5 ns

L2 cache reference 7 ns

Main memory reference/RAM 100 ns

Send 2K bytes over 1 Gbps network 20,000 ns

Read 1 MB sequentially from memory 250,000 ns

Round trip within same datacenter 500,000 ns

Disk seek 10,000,000 ns

Read 1 MB sequentially from network 10,000,000 ns

Read 1 MB sequentially from disk 30,000,000 ns

Send packet CA->Netherlands->CA 150,000,000 ns

Key take-away: Each successive memory type is (at least) an order of magnitude worse than the one before it. Disk seeks are very slow.