TY - JOUR
T1 - Parallelizing complex streaming applications on distributed scratchpad memory multicore architecture
AU - Chen, Shin Kai
AU - Hung, Cheng Yu
AU - Chen, Ching Chih
AU - Liu, Chih-Wei
PY - 2014/1/1
Y1 - 2014/1/1
N2 - Multicore processors can provide sufficient computing power and flexibility for complex streaming applications, such as high-definition video processing. For less hardware complexity and power consumption, the distributed scratchpad memory architecture is considered, instead of the cache memory architecture. However, the distributed design poses new challenges to programming. It is difficult to exploit all available capabilities and achieve maximal throughput, due to the combined complexity of inter-processor communication, synchronization, and workload balancing. In this study, we developed an efficient design flow for parallelizing multimedia applications on a distributed scratchpad memory multicore architecture. An application is first partitioned into streaming components and then mapped onto multicore processors. Various hardware-dependent factors and application-specific characteristics are involved in generating efficient task partitions and allocating resources appropriately. To test and verify the proposed design flow, three popular multimedia applications were implemented: a full-HD motion JPEG decoder, an object detector, and a full-HD H.264/AVC decoder. For demonstration purposes, SONY PlayStation ® 3 was selected as the target platform. Simulation results show that, on PS3, the full-HD motion JPEG decoder with the proposed design flow can decode about 108.9 frames per second (fps) in the 1080p format. The object detection application can perform real-time object detection at 2.84 fps at 1280 × 960 resolution, 11.75 fps at 640 × 480 resolution, and 62.52 fps at 320 × 240 resolution. The full-HD H.264/AVC decoder applications can achieve nearly 50 fps.
AB - Multicore processors can provide sufficient computing power and flexibility for complex streaming applications, such as high-definition video processing. For less hardware complexity and power consumption, the distributed scratchpad memory architecture is considered, instead of the cache memory architecture. However, the distributed design poses new challenges to programming. It is difficult to exploit all available capabilities and achieve maximal throughput, due to the combined complexity of inter-processor communication, synchronization, and workload balancing. In this study, we developed an efficient design flow for parallelizing multimedia applications on a distributed scratchpad memory multicore architecture. An application is first partitioned into streaming components and then mapped onto multicore processors. Various hardware-dependent factors and application-specific characteristics are involved in generating efficient task partitions and allocating resources appropriately. To test and verify the proposed design flow, three popular multimedia applications were implemented: a full-HD motion JPEG decoder, an object detector, and a full-HD H.264/AVC decoder. For demonstration purposes, SONY PlayStation ® 3 was selected as the target platform. Simulation results show that, on PS3, the full-HD motion JPEG decoder with the proposed design flow can decode about 108.9 frames per second (fps) in the 1080p format. The object detection application can perform real-time object detection at 2.84 fps at 1280 × 960 resolution, 11.75 fps at 640 × 480 resolution, and 62.52 fps at 320 × 240 resolution. The full-HD H.264/AVC decoder applications can achieve nearly 50 fps.
KW - Distributed scratchpad memory architecture
KW - Multicore architecture
KW - Parallel programming
KW - Streaming application
UR - http://www.scopus.com/inward/record.url?scp=84906945872&partnerID=8YFLogxK
U2 - 10.1007/s10766-013-0256-7
DO - 10.1007/s10766-013-0256-7
M3 - Article
AN - SCOPUS:84906945872
VL - 42
SP - 875
EP - 899
JO - International Journal of Parallel Programming
JF - International Journal of Parallel Programming
SN - 0885-7458
IS - 6
ER -