Core implementation: Difference between revisions

From openpipeflow.org
Jump to navigation Jump to search
Line 345: Line 345:
== Modifications for the parallel implementation ==
== Modifications for the parallel implementation ==


The code has been parallelised using the Message Passing Interface (MPI) in a way designed to be as unintrusive as possible. Sections of the code with special MPI function calls are separated from the serial sections by preprocessor directives. The number of processors is given by the parameter <tt>_Np=_Nr*_Ns</tt> set in <tt>parallel.h</tt>. The rank of the local processor is given by <tt>mpi_rnk</tt> declared in <tt>mpi.f90</tt>, and <tt>mpi_sze</tt>$=$<tt>_Np</tt>.
The code has been parallelised using the Message Passing Interface (MPI) in a way designed to be as unintrusive as possible. Sections of the code with special MPI function calls are separated from the serial sections by preprocessor directives. The number of processors is given by the parameter <tt>_Np=_Nr*_Ns</tt> set in <tt>parallel.h</tt>. The rank of a process is given by <tt>mpi_rnk</tt> declared in <tt>mpi.f90</tt>, and <tt>mpi_sze</tt>$=$<tt>_Np</tt>.


For modest parallelisations, it is recommended to vary <tt>_Nr</tt> and to keep <tt>_Ns=1</tt> (radial split only).  This ensures a couple of conditions are met that could easily be forgotten when setting up a run: <tt>_Ns</tt> must divide both <tt>i_Z</tt> and <tt>i_M</tt>.
For modest parallelisations, it is recommended to vary <tt>_Nr</tt> and to keep <tt>_Ns=1</tt> (radial split only).  This ensures a couple of conditions are met that could easily be forgotten when setting up a run: <tt>_Ns</tt> must divide both <tt>i_Z</tt> and <tt>i_M</tt>.

Revision as of 02:02, 21 July 2015

$ \renewcommand{\vec}[1]{ {\bf #1} } \newcommand{\bnabla}{ \vec{\nabla} } \newcommand{\Rey}{Re} \def\vechat#1{ \hat{ \vec{#1} } } \def\mat#1{#1} $

Spatial representation

Finite difference stencil — $r$

A function $p(r)$ is represented by the values $p_n=p(r_n)$, where $p(r)$ is evaluated on the $N$ radial points $r_n$; $n=1..N$. There is no point on the axis $r=0$ to avoid singularities in $1/r$ terms. The points are located at the roots of a Tchebyshev polynomial, such that they are clustered near the wall, and slightly so near the axis to compensate for loss of order in the calculation of derivatives. (They are calculated using banded matrices, implying use of fewer points at the axis and boundary).

Differentiation

Taylor expansions about central point $x_0$, $k$ neighbouring points each side

$\label{eq:taylorexp} \begin{array}{lclrrrr}

  f(x_{-k}) &=& f(x_0)
  &+& (x_{-k}-x_0)\,f'(x_0) &+& \frac{(x_{-k}-x_0)^2}{2!}\,f(x_0)+\dots
  \\ & \vdots & \\
  f(x_k) &=& f(x_0)
  &+& (x_k-x_0)\,f'(x_0) &+& \frac{(x_k-x_0)^2}{2!}\,f(x_0)+\dots

\end{array}$

The expansions \ref{eq:taylorexp} can be written

$\vec{f}=\mat{A}\,\vec{df}, \quad

  \vec{df}=\mat{A}^{-1}\,\vec{f}, \quad
  \vec{f}=\left[\begin{array}{c}f(x_{-k})\\
          \vdots\\ f(x_k)\end{array}\right], \quad
  \vec{df}=\left[\begin{array}{c}
          f(x_0)\\ f'(x_0)\\ f(x_0)\\ \vdots \end{array}\right]$

Derivatives are calculated using weights from the appropriate row of $\mat{A}^{-1}$.

Integration

Integrating \ref{eq:taylorexp} the indefinite integral may be approximated about $x_0$

$\begin{eqnarray*}

   \int f(x) \, {\mathrm d}x & = &
   \begin{array}{c}(x-x_0)\,f(x_0) + \frac{(x-x_0)^2}{2!}\,f'(x_0) +
   \frac{(x-x_0)^3}{3!}\,f(x_0) + \dots \end{array}
   \\
   & = &
   [\begin{array}{lll}(x-x_0) & \frac{(x-x_0)^2}{2!} &
   \dots \end{array}] \,\, \mat{A}^{-1}\vec{f}\end{eqnarray*}$

The total integral is approximated by

$ \frac1{2} \sum_n \int_{x_{n-1}}^{x_{n+1}} f(x) \, {\mathrm d}x$

Matrix representation of linear operators

Linear equations can be written in the matrix form

$\mat{L} \,\vec{p} = \mat{M}\, \vec{q} + \vec{s}.$

$\mat{L}$ and $\mat{M}$ are $N\times N$ matrices representing linear operators. If $\vec{q}$ and $\vec{s}$ are known then the right-hand side is simple to evaluate. Consider the equation

$\label{eq:matmod}

  \mat{L}\, \vec{p} = \vec{q},$

where $\vec{p}$ is unknown. If the matrix operators are formed by linear combinations of the weights in #Differentiation, using only $k$ neighbouring points to each side, they are banded. Boundary conditions are formed in the same way and are placed in the end rows. The equation is solved by forward-backward substitution, using the banded LU-factorisation of $\mat{L}$.

figure=LU.eps, scale=0.65

Fourier representaion — $\theta$, $z$

Expansion of variables

$A(\theta,z) =

     \sum_{km} A_{km} {\mathrm e}^{{\mathrm i} (\alpha kz + m_0 m\theta)}$

$A$ real implies $A_{km} = A_{-k,-m}^*$ (coefficients are conjugate-symmetric). It is sufficient to keep only $m\ge 0$, and for $m=0$ keep $k\ge 0$.

Implied conditions on the axis

The geometry enforces conditions on each variable at the axis when expanded over Fourier modes in $\theta$. For scalars and the $z$-component of a vector $A_z$, each mode is even in $r$ if $m$ is even, and is odd if $m$ is odd. Each mode for $A_r$ and $A_\theta$ is even in $r$ if $m$ is odd, and is odd if $m$ is even. This implies that either the function or its derivative is zero on the axis.

Orthogonality

Exponentials

$\int_0^{2\pi} {\mathrm e}^{{\mathrm i}(m+n)} \, {\mathrm d}\theta

     = 2\pi \, \delta_{m,-n}\, ,
     \qquad
     \int_0^{\frac{2\pi}{\alpha}} {\mathrm e}^{{\mathrm i}\alpha(k+j)} \, {\mathrm d}z 
     = \frac{2\pi}{\alpha} \, \delta_{k,-j}$

Energy integral

$E \, = \,

     {\textstyle \frac1{2}} \int \vec{A} \cdot \vec{A} \, {\mathrm d}V 
     \, = \,
     \frac{2\pi^2}{\alpha} \, \sum_{km}
     \int |\vec{A}_{km}|^2 \, r\,{\mathrm d}r$

$E = \sum_{m=0}^{\infty} E_m,

  \qquad
     E_m =
     \left\{
        \begin{array}{ll}
           \displaystyle
           \frac{2\pi^2}{\alpha}\int\vec{A}_{00}^2 \, r \, {\mathrm d}r
           + \frac{4\pi^2}{\alpha} \sum_{k>0}
           \int|\vec{A}_{k0}^2| \, r \, {\mathrm d}r,
           & m=0, \\[15pt]
           \displaystyle
           \frac{4\pi^2}{\alpha} \sum_k 
           \int |\vec{A}_{km}|^2 \, r \, {\mathrm d}r,
           & m>0 .
        \end{array}
     \right.$

Volume integral

$E \, = \,

     {\textstyle \frac1{2}} \int A \, {\mathrm d}V 
     \, = \,
     \frac{4\pi^2}{\alpha} \, \int A_{00} \, r\,{\mathrm d}r$


Temporal discretisation

PPE-formulation with correct boundary conditions

Write the time-discretised Navier–Stokes equations in the form

$\label{eq:NSdisc}

        \left\{\begin{array}{rcl}
    \mat{X}\, \vec{u}^{q+1} & = & \mat{Y}\, \vec{u}^q 
                + \vec{N}^{q+\frac{1}{2}} - \bnabla p \, , \\
                \nabla^2 p & = & \bnabla\cdot(\mat{Y}\, \vec{u}^q 
                + \vec{N}^{q+\frac{1}{2}}) ,
        \end{array}\right.$

where $q$ denotes time $t_q$, which is sixth order in $r$ for $\vec{u}^{q+1}$ and second order for $p$, where the solenoidal condition is not explicitly imposed. Symmetry conditions provide the conditions at the axis. The difficulty is in imposing the remaining four — this system should be inverted, in principle, simultaneously for $p$ and $\vec{u}^{q+1}$ with boundary conditions $\vec{u}^{q+1}=\vec{0}$ and $\bnabla\cdot\vec{u}^{q+1}=0$ on $r=R$ (Rempfer 2006). In practice it would be preferable to invert for $p$ first then for $\vec{u}^{q+1}$, but the boundary conditions to not involve $p$ directly.

Note that the $\mat{Y}\,\vec{u}^q$ term has been included in the right-hand side of the pressure-Poisson equation, the divergence of which should be small. Assume that pressure boundary condition is known: the right-hand side of the Navier–Stokes equation is then projected onto the space of solenoidal functions though $p$ and hence after inversion, $\vec{u}^{q+1}$ will be solenoidal.

Consider the ‘bulk’ solution, $\{\bar{\vec{u}},\bar{p}\}$, obtained from solution of the following:

$\label{eq:NSbulk}

        \left\{\begin{array}{rcl}
    \mat{X}\, \bar{\vec{u}} & = & \mat{Y}\, \vec{u}^q 
                + \vec{N}^{q+\frac{1}{2}} - \bnabla \bar{p} \, , \\
                \nabla^2 \bar{p} & = & \bnabla\cdot(\mat{Y}\, \vec{u}^q 
                + \vec{N}^{q+\frac{1}{2}}) ,
        \end{array}\right.$

with boundary conditions $\bar{\vec{u}}=\vec{0}$ and $\partial_{r}\bar{p}=0$. Introduce the following systems:

$\label{eq:NSp0}

        \left\{\begin{array}{rcl}
         \vec{u}' & = & -\bnabla p' \, , \\
                \nabla^2 p' & = & 0,
        \end{array}\right.$

with boundary condition $\partial_{r}p'=1$ on $r=R$, and

$\label{eq:NSu0}

        \left\{\begin{array}{rcl}
    \mat{X}\, \vec{u}' & = & \vec{0},
        \end{array}\right.$

with boundary conditions $u'_+=1$, $u'_-=1$, $u'_z=\mathrm{i}$ on $r=R$ (see Decoupling_the_equations). The system \ref{eq:NSp0} provides a linearly independent function $\vec{u}'_j$ that may be added to $\bar{\vec{u}}$ without affecting the right-hand side in \ref{eq:NSbulk}, but altering (to correct) the boundary condition applied. (Note that $\bnabla^2(\bnabla p')=\bnabla(\nabla^2 p')-\bnabla \wedge\bnabla \wedge\bnabla p'=0$.) Similarly the system \ref{eq:NSu0} provides a further three functions. The superposition

$\label{eq:usuperpos}

  \vec{u}^{q+1} = \bar{\vec{u}} + \sum_{j=1}^4 a_j\, \vec{u}'_j \,$

may be formed in order to satisfy the four boundary conditions, $\vec{u}^{q+1}=\vec{0}$ and $\bnabla\cdot\vec{u}^{q+1}=0$ on $r=R$. Substituting \ref{eq:usuperpos} into the boundary conditions, they may be written

$\label{eq:velBCs}

  \mat{A}\,\vec{a} = -\vec{g}(\bar{\vec{u}}) ,$

where $\mat{A}=\mat{A}(\vec{g}(\vec{u}'))$ is a 4$\times$4 matrix. The appropriate coefficients required to satisfy the boundary conditions are thereby recovered from solution of this small system for $\vec{a}$. The error in the boundary conditions $g_j(\vec{u}^{q+1})$ using the influence-matrix technique is at the level of the machine epsilon, typically order 1e-18.

The functions $u'_j(r)$, the matrix $\mat{A}$ and its inverse may all be precomputed. The boundary conditions for $\vec{u}'$ have been chosen so that that $u'_\pm$ are pure real, $u'_z$ is pure imaginary, and $\mat{A}$ is real. For each timestep, this application of the influence matrix technique requires only evaluation of the deviation from the boundary condition, multiplication by a 4$\times$4 real matrix, and the addition of only two functions to each component of $\vec{u}$, each either pure real or pure imaginary. Compared to the evaluation of nonlinear terms, the computational overhead is negligible.

Predictor-corrector

The model equation for each Fourier mode is

$\label{eq:harmmod}

  (\partial_{t} - \nabla^2) f = N,$

where nonlinear terms have been evaluated on each radial point by the transform method, and is solved as in #Matrix_representation_of_linear_operators. The predictor at time $t_q$, with Euler nonlinear terms and implicitness $c$

$\frac{f_1^{q+1}-f^q}{\Delta t}

  - \left( c\nabla^2 f_1^{q+1} + (1-c)\nabla^2 f^q \right)
  = N^q ,$

$\left( \frac1{\Delta t} - c\nabla^2 \right) f_1^{q+1}

  = \left( \frac1{\Delta t} + (1-c)\nabla^2 \right) f^q + N^q .$

Corrector iterations are

$\left( \frac1{\Delta t} - c\nabla^2 \right) f_{j+1}^{q+1}

  = \left( \frac1{\Delta t} + (1-c)\nabla^2 \right) f^q
  + c\,N_j^{q+1} + (1-c)\, N^q ,$

or equivalently, for the correction $f_{corr} = f_{j+1}^{q+1} - f_j^{q+1}$

$\left( \frac1{\Delta t} - c\nabla^2 \right) f_{corr}

  = c\,N_j^{q+1} - c\, N_{j-1}^{q+1} ,$

where $j=1,2,\dots$ and $N_0^{q+1}=N^q$. The size of the correction $\|f_{corr}\|$ must reduce at each iteration. For $c=\frac1{2}$ the scheme is second order such that $\|f_{corr}\| \sim \Delta t^2$.

Timestep control

$\Delta t = \mathrm{C} \, \min(\Delta \, / \, |\vec{v}| ) ,

  \qquad 
  0<\mathrm{C}<1,$

where $\mathrm{C}$ is the Courant number.

The timestep should also be small enough such that the corrector norm $\|f_{corr}\|$ is satisfactorily small. This may be important for integrating initial transients, but ought not to be the limiting factor in general.

The code structure

Naming conventions

Parameters

Parameters defined within parameters.f90 are given a prefix indicating the data type. For example:

$N$ i_N integer
$\Delta t$ d_timestep double
fixed flux? b_fixed_flux boolean

Note that the naming convention frees the variable name for dummy variables, e.g.:

   integer :: n
   do n = 1, i_N
      ...

Public variables

By default variables declared within a module are accessible to any other module or subroutine that uses the module. To ensure that it is clear where a public variable is declared, variables are given a prefix. Examples:

$\Delta t$ tim_dt timestep.f90
$u_r$ vel_r velocity.f90

Ordering the Fourier modes

$A = \sum_{km} A_{km} {\mathrm e}^{{\mathrm i}(\alpha kz + m_0 m\theta)},$

Coefficients are stored as in the following loop, where rather than two indices k and m, the single index nh labels the modes:

   nh = -1
   do m = 0, M-1
      m_ = m*Mp
      do k = -(K-1), (K-1)
         if(m==0 .and. k<0) cycle
         nh = nh+1
         ...

NOTES:

  • Loops for $m<0$ are skipped. Values of coefficients for $m<0$ are inferred from the conjugate symmetric property, $A_{km}=A_{-k,-m}^*$.
  • This loop, and its parallel equivalent, are predefined macros in parallel.h. The macro _loop_km_begin replaces the double-loop above.

Data types

The data types below consist of logical data groups to facilitate the passing of data between functions. For clarity, in the following text they are defined as though we were working on a single processor. See Parallel regarding subtle differences in the definitions due to parallelisation.

coll, spec — storage of coefficients

The collocated data type (coll) is the principle type used for mode-independent operations.

   type coll
      double precision     :: Re(i_N, 0:i_H1)
      double precision     :: Im(i_N, 0:i_H1)
   end type coll

The value Re(n,nh) is the real part of the coefficient for the nh$^{\mathrm{th}}$ harmonic at $r_n$.

The spectral type (spec) is a data type for operations independent at each radial point. It is rarely used other than as a transitory data format between the coll and phys types.

   type spec
      double precision     :: Re(0:i_H1, i_N)
      double precision     :: Im(0:i_H1, i_N)
   end type spec

Conversions between the coll and spec types involve only a transpose, var_coll2spec() and var_spec2coll().

phys — real space data

The type (phys) is used for data in real space.

   type phys
      double precision     :: Re(0:i_Z-1, 0:i_Th-1, i_N)
   end type phys

The element Re(k,m,n) refers to the value at $z_k$, $\theta_m$ and $r_n$.

rdom — the radial domain

The type (rdom) contains information the radial domain. The public variable mes_D is declared in meshs.f90. Data includes the number of radial points mes_D%N, powers of $r$, $r_n^p$=mes_D%r(n,p), weights for integration $\int_0^1 f(r)\,r\,{\mathrm d}r$=dot_product(f(1:i_N),mes_D%intrdr), matrices for taking the $p^{\mathrm{th}}$ derivative and more:

   type rdom
      integer          :: N
      double precision :: r(i_N,i_rpowmin:i_rpowmax)
      double precision :: intrdr(i_N)
      type (mesh)      :: dr(i_KL)
      ...
   end type rdom

mesh, lumesh — storage of banded matrices

The finite-difference stencil has finite width (Finite_Differences) and so matrix operations involve matrices that are banded. The type (mesh), defined in meshs.f90, is used for matrices on the radial mesh involving $kl$ points to either side of each node.

   type mesh
      double precision :: M(2*i_KL+1, i_N)
   end type mesh

For a matrix A, the element M(KL+1+n-j, j) = A(n,j). See the man page for the lapack routine dgbtrf. The LU-factorisation of a banded matrix is also banded:

   type lumesh
      integer          :: ipiv(i_N)
      double precision :: M(3*i_KL+1, i_N)
   end type lumesh

The element M(2*KL+1+n-j, j) = A(n,j). Pivoting information is stored along with the matrix.

Modifications for the parallel implementation

The code has been parallelised using the Message Passing Interface (MPI) in a way designed to be as unintrusive as possible. Sections of the code with special MPI function calls are separated from the serial sections by preprocessor directives. The number of processors is given by the parameter _Np=_Nr*_Ns set in parallel.h. The rank of a process is given by mpi_rnk declared in mpi.f90, and mpi_sze$=$_Np.

For modest parallelisations, it is recommended to vary _Nr and to keep _Ns=1 (radial split only). This ensures a couple of conditions are met that could easily be forgotten when setting up a run: _Ns must divide both i_Z and i_M.

Data is split over the spherical harmonics for the linear parts of the code — evaluating curls, gradients and matrix inversions for the timestepping. These linear operations do not couple modes. Here all radial points for a particular mode are located on the same processor, separate modes may be located on separate processes. In the definition of type (coll) the parameter i_H1 is replaced by i_pH1, the maximum number of modes expected for each process/core. The variable var_H has elements to determine which harmonics are on which processor. The elements pH0 and pH1 are the zeroth index and the number of harmonics for the local process, minus one. The corresponding values for all processors are stored in the arrays pH0_(rank) and pH1_(rank). For a variable of type (coll), valid values for the harmonic index in Re(n,nh) are nh=0..pH1 but refer to the indices nh=pH0..pH0+pH1 of the serial case, see #Ordering_of_the_Fourier_modes.

Data is split radially when calculating Fourier transforms and when evaluating products in real space. In the definitions of type (spec) and type (phys) the parameter i_N is replaced by i_pN, the maximum number of radial points for each process. The location of the radial subdomain on the local processor is given by additional elements of the variable mes_D. The elements pNi and pN are the location of the first (inner) radial point and the number of points. The values for all processors are stored in the arrays pNi_(rank) and pN_(rank). For a variable of type (spec), valid values for the radial index in Re(nh,n) are n=1..pN which refer to the indices of $r_n$, where $n$=pNi..pNi+pN-1.

The bulk of communication occurs during the data transposes in the functions var_coll2spec() and var_spec2coll(). For _Np=_Nr (_Ns=1) processors the data is transferred in _Np-1 steps. At each step each processor sends and receives a block of data, the source and destination are selected systematically as set in the following loop:

   do step = 1, mpi_sze-1
      dest  = modulo(mpi_rnk+step, mpi_sze)
      src   = modulo(mpi_rnk-step+mpi_sze, mpi_sze) 
      ...

If _Ns/=1, then in physical space the data is split axially and the number of points present is i_pZ=i_Z/_Ns. Then the total data that must be communicated is doubled if _Ns/=1, but only _Nr+_Ns-2 messages are required vs _Np-1$\approx$_Nr*_Ns. The number of messages then scales with the root of _Np substantially reducing latency.