Recently, there has been significant attention to WiFi-based human pose estimation (HPE) within the research community due to its device-free nature, cost-effectiveness, and privacy preservation. The implementation of such a solution requires improved model performance while upholding efficiency, particularly when employing resource-constrained devices. To address these challenges, this paper introduces a novel approach, the so-called WiLHPE, which integrates multi-modal sensors such as cameras and WiFi to accurately detect human pose landmarks. WiLHPE involves processing the raw WiFi signal through a novel neural network architecture to dynamically learn convolutional kernels weighted with attention across channel and frequency kernel spaces. This innovative approach diversifies the kernels to enhance the recognition capabilities of WiFi signals without introducing additional complexity, thus guaranteeing efficiency. Results conducted on the MM-Fi dataset underscore the superiority of WiLHPE over state-of-the-art approaches, all while ensuring minimal computational overhead. This makes the proposed approach highly suitable for large-scale scenarios.