Skip to content

Releases: lattice/quda

QUDA v1.1.0

28 Oct 22:49
d2320dd
Compare
Choose a tag to compare

Version 1.1.0 - October 2021

  • Add support for NVSHMEM communication for the Dslash operators, for significantly improved strong scaling. See https://github.com/lattice/quda/wiki/Multi-GPU-with-NVSHMEM for more details.

  • Addition of the MSPCG preconditioned CG solver for Möbius fermions. See https://github.com/lattice/quda/wiki/The-Multi-Splitting-Preconditioned-Conjugate-Gradient-(MSPCG),-an-application-of-the-additive-Schwarz-Method for more details.

  • Addition of the Exact One Flavor Algorithm (EOFA) for Möbius fermions. See https://github.com/lattice/quda/wiki/The-Exact-One-Flavor-Algorithm-(EOFA) for more details.

  • Addition of a fully GPU native Implicitly Restarted Arnoldi eigensolver (as opposed to partially relying on ARPACK). See https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers#implicitly-restarted-arnoldi-eigensolver for more details.

  • Significantly reduced latency for reduction kernels through the use of heterogeneous atomics. Requires CUDA 11.0+.

  • Addition of support for a split-grid multi-RHS solver. See https://github.com/lattice/quda/wiki/Split-Grid for more details.

  • Continued work on enhancing and refining the staggered multigrid algorithm. The MILC interface can now drive the staggered multigrid solver.

  • Multigrid setup can now use tensor cores on Volta, Turing and Ampere GPUs to accelerate the calculation. Enable with the
    QudaMultigridParam::use_mma parameter.

  • Improved support of managed memory through the addition of a prefetch API. This can dramatically improve the performance of the multigrid setup when oversubscribing the memory.

  • Improved the performance of using MILC RHMC with QUDA

  • Add support for a new internal data order FLOAT8. This is the default data order for nSpin=4 half and quarter precision fields,
    though the prior FLOAT4 order can be enabled with the cmake option QUDA_FLOAT8=OFF.

  • Remove of the singularity from the reconstruct-8 and reconstruct-9 compressed gauge field ordering. This enables support for free fields with these orderings.

  • The clover parameter convention has been codified: one can either
    1.) pass in QudaInvertParam::kappa and QudaInvertParam::csw separately, and QUDA will infer the necessary clover coefficient, or
    2.) pass an explicit value of QudaInvertParam::clover_coeff (e.g. CHROMA's use case) and that will override the above inference.

  • QUDA now includes fast-compilation options (QUDA_FAST_COMPILE_DSLASH and QUDA_FAST_COMPILE_REUDCE) which enable much faster build times for development at the expense of reduced performance.

  • Add support for compiling QUDA using clang for both the host and device compiler.

  • While the bulk of the work associated with making QUDA portable to different architectures will form the soul of QUDA 2.0, some of the initial refactoring associated with this has been applied.

  • Significant cleanup of the tests directory to reduce boiler plate.

  • General improvements to the cmake build system using modern cmake features. We now require cmake 3.15.

  • Extended the ctest list to include some optional benchmarks.

  • Fix a long-standing issue with multi-node Kepler GPU and Intel dual socket systems.

  • Improved ASAN integration: SANITIZE builds now work out of the box with no need to set the ASAN_OPTIONS environment variable.

  • Add support for the extended QIO branch (now required for MILC).

  • Bump QMP version to 2.5.3.

  • Updated to Eigen 3.3.9.

  • Multiple bug fixes and clean up to the library. Many of these are listed here: https://github.com/lattice/quda/milestone/24?closed=1

QUDA v1.0.0

10 Jan 19:31
66729fd
Compare
Choose a tag to compare

Version 1.0.0 - 10 January 2020

  • Add support for CUDA 10.2: QUDA 1.0.0 is supported on CUDA 7.5-10.2
    using either GCC or clang compilers. CUDA 10.x and either GCC >=
    6.x or clang >= 6.x are highly recommended.

  • Significant improvements to the CMake build system and removal of the
    legacy configure build.

  • Added more targeted compilation options to constrain which
    precisions and reconstruct types are compiled. QUDA_PRECISION is a
    cmake parameter that is a 4-bit number corresponding to which
    precisions are enabled, with 1 = quarter, 2 = half, 4 = single and 8
    = double, the default is 14 which enables double, single and half
    precision. QUDA_RECONSTRUCT is a 3-bit number corresponding to
    which reconstruct types are enabled, with 1 = reconstruct-8/9, 2 =
    reconstruct-12/13 and 4 = reconstruct-18, the default is 7 which
    enables all reconstruct types.

  • Completely rewritten all dslash kernels using the accessor
    framework. This dramatically reduces code complexity and improve
    performance.

  • New physics functionality added: gauge Laplace kernel, Gaussian
    quark smearing, topological charge density.

  • QUDA can now be built to either utilize texture-memory reads or to
    use direct memory accessing (cmake option QUDA_TEX). The default
    has textures on, though we note that since Pascal it can be
    advantageous to disable textures and utilize direct reads.

  • QUDA is no longer supported on the Fermi generation of GPUs (sm_20
    and sm_21). Compilation and running should still be possible but
    will require compilation with texture objects disabled.

  • Added supported for quarter precision (QUDA_QUARTER_PRECISION) for
    the linear operator and associated solvers.

  • Implemented both CA-CG and CA-GCR communication avoid solvers, for
    use either as stand-alone solvers or as a means to accelerate
    multigrid.

  • Continued evolution and optimization of the multigrid framework.
    Regardless, we advise users to use the latest develop branch when
    using multigrid, since it continues to be a fast-moving target with
    continual focus on optimization and improvement.

  • An implementation of the Thick Restarted Lanczos Method (TRLM) for
    eigenvector solving of the normal operator.

  • Lanczos-accelerated multigrid through the use of coarse-grid
    deflation and / or using singular vectors to define the prolongator.

  • Removal of the legacy contraction and co-variant derivative
    algorithms, and replacement with accessor-based rewrites.

  • Improved heavy-quark residual convergence which ensure correct
    convergence for MILC heavy quark observables.

  • Experimental support for Just-In-Time (JIT) compilation using Jitify.

  • Significantly improved unit testing framework using ctest.

  • QUDA can now be built to target Google's address sanitizer
    (CMAKE_BUILD_TYPE option is SANITIZE) for improved debugging.

  • QUDA can now download and install the USQCD libraries QMP and QIO
    automatically as part of the compilation process. To enable this,
    the option QUDA_DOWNLOAD_USQCD=ON should be set. Similarly to Eigen
    installation this requires access to the outside internet.

  • QUDA can now download and install the ARPACK library automatically
    if the QUDA_DOWNLOAD_ARPACK option is enabled.

  • Updated to CUB 1.8.

  • Multiple bug fixes and clean up to the library. Many of these are
    listed here: https://github.com/lattice/quda/milestone/21?closed=1

QUDA v0.9.0

24 Jul 13:52
v0.9.0
49dec72
Compare
Choose a tag to compare

Version 0.9.0 - 24 July 2018

  • Add support for CUDA 9.x: QUDA 0.9.0 is supported on CUDA 7.0-9.2.

  • Continued focus on optimization of multi-GPU execution, with
    particular emphasis on Dslash scaling. For more details on
    optimizing multi-GPU performance, see
    https://github.com/lattice/quda/wiki/Multi-GPU-Support

  • On systems that support it, QUDA now uses direct peer-to-peer
    communication between GPUs with in the same node. The Dslash policy
    autotuner will ascertain the optimal commuication route to take,
    whether it be to route through CPU memory, use DMA copy engines or
    directly write the halo buffer to neighboring GPUs.

  • On systems that support it, QUDA will take advantage of GPU Direct
    RDMA. This is enabled through setting the environment variable
    QUDA_ENABLE_GDR=1 which will augment the dslash tuning policies to
    include policies using GPU-aware MPI to facilitate direct GPU-NIC
    communication. This can improve strong scaling by up to 3x.

  • Improved precision when using half precision (use rounding instead
    of truncation when converting to/from float).

  • Add support for symmetric preconditioning for 4-d preconditioned
    Shamir and Mobius Dirac operators.

  • Added initial support for multi-right-hand-side staggered Dirac
    operator (treat the rhs index as a fifth dimension).

  • Added initial implementation of block CG linear solver.

  • Added BiCGStab(l) linear solver. The parameter "l" corresponds to
    the size of the space to perform GCR-style residual minimization.
    This is typically much better behaved than BiCGStab for the Wilson
    and Wilson-clover linear systems.

  • Initial version of adaptive multigrid fully implemented into QUDA.

  • Creation of multi-blas and multi-reduction framework, this is
    essential for high performance for pipelined, block and
    communication-avoiding solvers that work on "matrices of vectors" as
    opposed to "scalars of vectors". The max tile size used by the
    multi-blas framework is set by QUDA_MAX_MULTI_BLAS_N cmake
    parameter, which default to 4 for reduced compile time. For
    production use of such solvers, this should be increase to 8..16.

  • Optimization of multi-shift solver using multi-blas framework to permit
    kernel fusion of all shift updates.

  • Complete rewrite and optimization of clover inversion, HISQ force
    kernels, HISQ link fattening algorithms using accessors.

  • QUDA can now directly load/store from MILC's site structure array.
    This removes the need to unpack and pack data prior to calling QUDA,
    and dramatically reduces CPU overhead.

  • Removal of legacy data structures and kernels. In particular
    original single-GPU only ASQTAD fermion force has been removed.

  • Implementation of STOUT fattening kernel.

  • Significant improvement to the cmake build system to improve
    compilation speed and aid productivity. In particular, QUDA now
    supports being built as a shared library which greatly reduces link
    time.

  • Autoconf and configure build system is no longer supported.

  • Automated unit testing of dslash_test and blas_test are now enabled
    using ctest.

  • Adds support for MPS, enabled through setting the environment
    variable QUDA_ENABLE_MPS=1. This allow GPUs to be oversubscribed by
    multiple processes, which can improve overall job throughput.

  • Implemented self-profiler that builds on top of autotuning
    framework. Kernel profile is output to profile_n.tsv, where n=0,
    with n incremented with each call to saveProfile (which dumps the
    profile to disk). An equivalent algorithm policy profile is output
    to profile_async_n.tsv which contains policies such as a complete
    dslash. Filename prefix and path can be overridden using
    QUDA_PROFILE_OUTPUT_BASE environment variable.

  • Implemented simple tracing facility that dumps the flow of kernels
    called through a single execution to trace.tsv. Enabled with
    environment variable QUDA_ENABLE_TRACE=1.

  • Multiple bug fixes and clean up to the library. Many of these are
    listed here: https://github.com/lattice/quda/milestone/15?closed=1

Pre-release 0.9 with old MILC interface

20 Oct 06:54
Compare
Choose a tag to compare
Pre-release

QUDA v0.9.0 will introduce a new MILC interface. The development version of MILC at https://github.com/milc-qcd/milc_qcd already uses the new interface.

This version is solely for backwards compatibility. It has been tested using a limited test set.

QUDA v0.8.0

20 Oct 06:55
Compare
Choose a tag to compare