The inter-process communication is needed in the MPI routines of the package whenever boundaries of the distributed data matrix do not coincide with those of the Toeplitz block. The communication pattern is local in a sense that it involves only neighboring processes and is therefore expected to scale well with a number of MPI processes (and it indeed does in the regime in which the tests have been done.) It involves each process sending to and receiving from a neighboring process a vector of data of the length defined by the half-bandwidth of the Toeplitz block, $\lambda$ , shared between them. This provides sufficient information to enable each process to compute a part of the Toeplitz-vector product corresponding to its input data on its own without any need for further data exchanges. In particular we note that all the FFT calls used by the package are either sequential or threaded.

The communication pattern as implemented is either non-blocking and then instituted with help of MPI_Isend and MPI_Irecv calls used twice to send to and receive from left and right, i.e.,

        // to the Left
        MPI_Irecv((LambdaIn + lambdaIn_offset), offsetn * m_rowwise, MPI_DOUBLE,
                  right, MPI_USER_TAG, comm, &requestLeft_r);
        MPI_Isend(LambdaOut, toSendLeft * m_rowwise, MPI_DOUBLE, left,
                  MPI_USER_TAG, comm, &requestLeft_s);
 
        // to the Right
        MPI_Irecv(LambdaIn, offset0 * m_rowwise, MPI_DOUBLE, left, MPI_USER_TAG,
                  comm, &requestRight_r);
        MPI_Isend((LambdaOut + lambdaOut_offset), toSendRight * m_rowwise,
                  MPI_DOUBLE, right, MPI_USER_TAG, comm, &requestRight_s);

what is followed by a series of respective MPI_Wait calls, i.e.,

        MPI_Wait(&requestLeft_r, &status);
        MPI_Wait(&requestLeft_s, &status);
        MPI_Wait(&requestRight_r, &status);
        MPI_Wait(&requestRight_s, &status);

or blocking implemented with help MPI_Sendrecv calls, i.e.,

        // to the Left
        MPI_Sendrecv(LambdaOut, toSendLeft * m_rowwise, MPI_DOUBLE, left,
                     MPI_USER_TAG, (LambdaIn + lambdaIn_offset),
                     offsetn * m_rowwise, MPI_DOUBLE, right, MPI_USER_TAG, comm,
                     &status);
 
        // to the Right
        MPI_Sendrecv((LambdaOut + lambdaOut_offset), toSendRight * m_rowwise,
                     MPI_DOUBLE, right, MPI_USER_TAG, LambdaIn,
                     offset0 * m_rowwise, MPI_DOUBLE, left, MPI_USER_TAG, comm,
                     &status);

The choice between the two is made with help of the global flag FLAG_BLOCKINGCOMM, which by default is set to 0 (non-blocking communication).