Face detection is one of the fundamental technologies for the future smart objects. However, its computation intensive property thwarts the practice of a real-time application on an embedded device. Parallel processing and many-core architecture have become a mainstream to achieve high performance in the future computing systems. The parallelism of an application needs to be exposed before being exploited by the parallel architecture. This paper performs a comprehensive analysis of the parallelism of a face detection algorithm at different algorithmic levels. This paper has demonstrated that each parallelism level has its own potential to enhance performance, but also imposes different limiting factors to the overall performance. Based on the analysis results and design experience, this paper proposes a multi-staged mixed-level parallelization scheme to retain the performance scalability and avoid the limiting factors. With this scheme, we are able to achieve up to 37.5x performance enhancement on a 64-core system.