We propose a three-layer image compression system consisting of a base-layer VVC (intra) codec, a learning-based residual layer codec, and a learnable hyperprior. This proposal (Team: NCTU-Commlab) is submitted to the Challenge on Learned Image Compression (CLIC) in March 2020. Our contribution is developing a data fusion attention module and integrating several known components together to form an efficient image codec, which has a higher compression performance than the standard VVC coding scheme. Unlike the conventional residual image coding, both our encoder and decoder take inputs also from the base-layer output. Also, we construct a refinement neural network to merge the residual-layer decoded residual image and the base-layer decoded image together to form the final reconstructed image. We tested two autoencoder structures for the encoder and decoder, namely, CNN with GDN , , and the generalized octave CNN . Our results show that the transmitted latent representations are very efficient in coding the residuals because the object boundary information can be provided by the proposed spatial attention module. The experiments indicate that the proposed system achieves better performance than the single-layer VVC at both PSNR and subjective quality at around 0.15 bit-per-pixel.