Nowadays, image-based people counting is an essential technique for public safety management. However, this work is still extremely challenging due to many kinds of scale issues caused by different congested scenes, different viewing points, different image sizes, and different density levels. In this paper, we proposed a CNNs-based framework for people counting and crowd density map estimation with the consideration of the scale problems. First, we introduced an encoder-decoder architecture, which is composed of Inception modules to learn the multi-scale feature representations. Besides, to be adaptive to image resolution, a multi-loss setting over different resolutions of density maps is designed for network training. Second, we apply multi-task learning to learn the joint features for the density map estimation task and the density level classification task. This helps to enhance the feature generality under different scenes. Finally, by adopting the U-net architecture, the encoder and decoder features are then fused to generate high-resolution density maps. The efficacy of the proposed method is evaluated in the extensive experiments by quantifying the counting performance through multiple evaluation criteria.