Each human cognitive function involves bottom-up and top-down processes. Several methods have been proposed for singing melody extraction by emphasizing either the bottom-up or top-down processes. For hearing, the bottom-up processes include spectral and spectro-temporal decomposition of the sound by the cochlea and the auditory cortex. In this paper, we propose a neural network, which includes spectro-temporal multi-resolution decomposition of the log-spectrogram of the sound and a semantic segmentation model to respectively address the bottom-up and top-down processing of hearing, for singing melody extraction. Simulation results show the proposed model outperforms all previously proposed methods, emphasizing either bottom-up or top-down processing, in almost all objective evaluation metrics.