Deep learning technologies have been developed rapidly in recent years and have played an important role in our lives. Among them, convolutional neural network (CNN) performs well in many applications. The quality of result is generally getting better as the number of convolutional layers increases, which also increases the computational complexity. Hence, a highly resource-efficient accelerator is demanded. In this paper, we propose a new CNN accelerator that features a delay-chain-free input data aligner as well as a dual-convolver processing element (DCPE). Our architecture does not require delay chains with a large number of registers for input data alignment, which not only reduces the area and power but improves the overall resource utilization. In addition, a set of DCPEs shares the same input aligner to produce multiple output feature maps concurrently, which offers the desirable computing power and reduces the external memory traffic. An accelerator instance with 8 DCPEs (144 MACs) has been implemented using TSMC 40nm process. The internal logic only consumes 285K gates and the total internal memory size is merely 44KB. As running VGG-16, the average performance is 190GOPS (@750MHz), the resource (MAC) utilization reaches 8S.3%, and the energy efficiency is 481GOPS/W.