Multi-PE and SIMD architectures are widely utilized to enhance computing power of embedded DSP processors. Design of such architectures may cost vast amount of computing resource. Allocation of computing resources becomes a critical design issue for embedded applications. If suitable hardware configurations and software algorithms can be explored in early design stage, enormous design time can be saved. In this paper, we construct an early-stage simulation framework for multi-PE and SIMD architecture using a multi-threaded scalable-SIMD library. Rely on the library, codes gin is realized in early design stage using high-level language. Estimated performance can be obtained by native and trace simulation. Object detection is involved as a case study. Application-specific rectangle addressing mode and hybrid vectorization for long SIMD word are designed. Parallelism exploration is then performed to search for appropriate computing resource arrangement. As a result, a four-PE 256-bit SIMD DSP processor is designed. Real time constraint can be achieved with 1/4 computing resources.