This PhD focuses on transforming robotic manipulation through the integration of multisensory data. The project aims to develop multimodal Vision-Language-Action (VLA) models to help robots learn from real-world experiences with minimal reliance on large annotated datasets. The goal is to enhance robot capabilities using energy-efficient machine learning models inspired by human learning processes, enabling applications in dynamic environments.