@haggaie @ferasd
@e-ago FYI
On Volta arch GPUs, CU_STREAM_WAIT_VALUE_NOR is available.
Therefore we can enable IBV_EXP_PEER_OP_POLL_NOR_DWORD_CAP.
When testing its use in libgdsync, we noticed gpudirect/libgdsync#68.
We tracked it down to mlx5_alloc_cq_buf non properly setting the owner bit to 1. Currently fixed as below:
for (i = 0; i < nent; ++i) {
cqe = buf->buf + i * cqe_sz;
cqe += cqe_sz == 128 ? 1 : 0;
cqe->op_own = MLX5_CQE_INVALID << 4;
if (cq->peer_ctx && (cq->peer_ctx->caps & IBV_EXP_PEER_OP_POLL_NOR_DWORD_CAP)) {
cqe->op_own |= MLX5_CQE_OWNER_MASK;
}
Not sure if more fixes are needed elsewhere, e.g. mlx5_cq_resize_copy_cqes