With the introduction of the Kinect as a gaming interfaces its broad commercial accessibility and high quality depth sensor has attracted the attention not only from consumers but also from researchers in the robotics community. The active sensing technique of the Kinect produces robust depth maps for reliable human pose estimation. But for a broader range of applications in robotic perception its active sensing approach fails under many operating conditions such like objects with specular and transparent surfaces. Recently an initial study has shown that part of the arising problems can be alleviated by complimenting the active sensing scheme with passive cross-modal stereo between the Kinect's rgb and ir camera. However the method is troubled by interference from the IR projector that is required for the active depth sensing method. We investigate these issues and conduct a more detailed study of the physical characteristics of the sensors as well as propose a more general method that learns optimal filters for cross-modal stereo under projected patterns. Our approach improves results over the baseline in a point-cloud-based object segmentation task without modifications of the kinect hardware and despite the interference by the projector.