In the H.264 video compression standard, the deblocking filter contributes about one-third of all computation in the decoder. With multiprocessor architectures becoming the future trend of system design, computation time reduction can be achieved if the deblocking filter well apportions its operations to multiple processing elements. In this paper, we apply a 16 pixel long boundary, the basic unit for deblocking in the H.264 standard, as the basis for analyzing and exploiting possible parallelism in deblocking filtering. Compared with existing approaches using a macroblock as a basic unit for analysis, a 16 pixel long boundary by having a finer granularity can improve the chances of increasing the degree of parallelism. Moreover, a possible compromise to fully utilize limited hardware resources and hardware architectural requirements for deblocking are also proposed in this paper. Compared with the 2D wave-front method order for deblocking both 1920*1080 and 1080*1920 pixel sized frames, the proposed design gains speedups of 1.57 and 2.15 times given an un-limited number of processing elements respectively. Using this approach, the execution time of the deblocking filter is proportional to the square root of the growth of the frame size (keeping the same width/height ratio), pushing the boundary of practical real-time deblocking of increasingly larger video sizes.