PAR Lecture 24, Mon Apr 17
Table of contents
Today we'll see some programming tips and then several parallel computing tools.
1 Software tips
1.1 Freeze decisions early: SW design paradigm
One of my rules is to push design decisions to take effect as early in the process execution as possible. Constructing variables at compile time is best, at function call time is 2nd, and on the heap is worst.

If I have to construct variables on the heap, I construct few and large variables, never many small ones.

Often I compile the max dataset size into the program, which permits constructing the arrays at compile time. Recompiling for a larger dataset is quick (unless you're using CUDA).
Accessing this type of variable uses one less level of pointer than accessing a variable on the heap. I don't know whether this is faster with a good optimizing compiler, but it's probably not slower.

If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.
 Read the data once to accumulate the necessary statistics.
 Construct the required ragged array.
 Reread the data and populate the array.
1.2 Faster graphical access to parallel.ecse
X over ssh is very slow.
Here are some things I've discovered that help.

Use a faster, albeit less secure, cipher:
ssh c arcfour,blowfishcbc C parallel.ecse.rpi.edu
(This does not work yet.)

Use xpra; here's an example:

On parallel.ecse:
xpra start :77; DISPLAY=:77 xeyes&
Don't everyone use 77, pick your own numbers in the range 2099.

On server, i.e., your machine:
xpra attach ssh:parallel.ecse.rpi.edu:77

I suspect that security is weak. When you start an xpra session, In suspect that anyone on parallel.ecse can display to it. I suspect that anyone with ssh access to parallel.ecse can try to attach to it, and the that 1st person wins.


Use nx, which needs a server, e.g., FreeNX.
2 Jack Dongarra videos

Sunway TaihuLight's strengths and weaknesses highlighted. 9 min. 8/21/2016.
This is the new fastest known machine on top500. A machine with many Intel Xeon Phi coprocessors is now 2nd, Nvidia K20 is 3rd, and some machine built by a company down the river is 4th. These last 3 machines have been at the top for a surprisingly long time.

An Overview of High Performance Computing and Challenges for the Future. 57min, 11/16/2016.
We saw the 1st 18 min in class.
3 More parallel tools
3.1 cuFFT Notes
 GPU Computing with CUDA Lecture 8  CUDA Libraries  CUFFT, PyCUDA from Christopher Cooper, BU
 video #8  CUDA 5.5 cuFFT FFTW API Support. 3 min.
 cuFFT is inspired by FFTW (the fastest Fourier transform in the west), which they say is so fast that it's as fast as commercial FFT packages.
 I.e., sometimes commercial packages may be worth the money.
 Although the FFT is taught for N a power of two, users often want to process other dataset sizes.
 The problem is that the optimal recursion method, and the relevant coefficients, depends on the prime factors of N.
 FFTW and cuFFT determine the good solution procedure for the particular N.
 Since this computation takes time, they store the method in a plan.
 You can then apply the plan to many datasets.
 If you're going to be processing very many datasets, you can tell FFTW or cuFFT to perform sample timing experiments on your system, to help in devising the best plan.
 That's a nice strategy that some other numerical SW uses.
 One example is Automatically Tuned Linear Algebra Software (ATLAS).
3.2 cuBLAS etc Notes
 BLAS is an API for a set of simple matrix and vector functions, such as multiplying a vector by a matrix.
 These functions' efficiency is important since they are the basis for widely used numerical applications.
 Indeed you usually don't call BLAS functions directly, but use higherlevel packages like LAPACK that use BLAS.
 There are many implementations, free and commercial, of BLAS.
 cuBLAS is one.
 One reason that Fortran is still used is that, in the past, it was easier to write efficient Fortran programs than C or C++ programs for these applications.
 There are other, very efficient, C++ numerical packages. (I can list some, if there's interest).
 Their efficiency often comes from aggressively using C++ templates.
 Matrix mult example
3.3 Matlab

Good for applications that look like matrices.
Considerable contortions required for, e.g., a general graph. You'd represent that with a large sparse adjacency matrix.

Using explicit for loops is slow.

Efficient execution when using builtin matrix functions,
but can be difficult to write your algorithm that way, and
difficult to read the code.

Very expensive and getting more so.
Many separately priced apps.

Uses stateoftheart numerical algorithms.
E.g., to solve large sparse overdetermined linear systems.
Better than Mathematica.

Most or all such algorithms also freely available as C++ libraries.
However, which library to use?
Complicated calling sequences.
Obscure C++ template error messages.

Graphical output is mediocre.
Mathematica is better.

Various ways Matlab can execute in parallel

Operations on arrays can execute in parallel.
E.g. B=SIN(A) where A is a matrix.

Automatic multithreading by some functions
Various functions, like INV(a), automatically use perhaps 8 cores.
The '8' is a license limitation.
Which MATLAB functions benefit from multithreaded computation?

PARFOR
Like FOR, but multithreaded.
However, FOR is slow.
Many restrictions, e.g., cannot be nested.
Matlab's introduction to parallel solutions
Start pools first with: MATLABPOOL OPEN 12
Limited to 12 threads.
Can do reductions.

Parallel Computing Server
This runs on a parallel machine, including Amazon EC2.
Your client sends batch or interactive jobs to it.
Many Matlab toolboxes are not licensed to use it.
This makes it much less useful.

GPU computing
Create an array on device with gpuArray
Run builtin functions on it.
Matlab's run built in functions on a gpu

3.4 Mathematica in parallel
You terminate an input command with shiftenter.
Some Mathematica commands:
Sin[1.] Plot[Sin[x],{x,2,2}] a=Import[ "/opt/parallel/mathematica/mtn1.dat"] Information[a] Length[a] b=ArrayReshape[a,{400,400}] MatrixPlot[b] ReliefPlot[b] ReliefPlot[b,Method>"AspectBasedShading"] ReliefPlot[MedianFilter[b,1]] Dimensions[b] Eigenvalues[b] *When you get bored* * waiting, type * alt. Eigenvalues[b+0.0] Table[ {x^i y^j,x^j y^i},{i,2},{j,2}] Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1] StreamPlot[{x*y,x+y},{x,3,3},{y,3,3}] $ProcessorCount $ProcessorType *Select *Parallel Kernel Configuration* and *Status* in the *Evaluation* menu* ParallelEvaluate[$ProcessID] PrimeQ[101] Parallelize[Table[PrimeQ[n!+1],{n,400,500}]] merQ[n_]:=PrimeQ[2^n1] Select[Range[5000],merQ] ParallelSum[Sin[x+0.],{x,0,100000000}] Parallelize[ Select[Range[5000],merQ]] Needs["CUDALink`"] *note the back quote* CUDAInformation[] Manipulate[n, {n, 1.1, 20.}] Plot[Sin[x], {x, 1., 20.}] Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}] Integrate[Sin[x]^3, x] Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}] Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}] Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}], {a, 1, 4}, {b, 1, 4}]
Unfortunately there's a problem that I'm still debugging with the Mathematica  CUDA interface.
Comments
Comments powered by Disqus