At I/O 2024, Google unveiled the impressive multimodal capabilities of its Gemini 1.5 AI model. This advanced language model can process inputs like photos, videos, audio, and text, generating intelligent responses. Google's AI team is now utilizing this technology to train robots to navigate their environments effectively.
Google DeepMind recently published a research paper and released several video clips demonstrating how robots can be trained to understand multimodal instructions, including natural language and images, to perform useful navigation tasks. The research focuses on a category of navigation tasks called Multimodal Instruction Navigation with Demonstration Tours (MINT), where the environment is introduced through a previously recorded video demonstration. Advances in Vision Language Models (VLMs) have shown significant potential in achieving these goals.
Training Robots with Gemini 1.5 AI Model
In a thread shared on X (formerly Twitter), Google highlighted the challenge of limited context length in many AI models, which hampers their ability to recall environments. However, the Gemini 1.5 Pro model, with its 1 million token context length, overcomes this limitation, enabling effective robot training for navigation.
Using human instructions, video tours, and common sense reasoning, the robots successfully navigate spaces. Trainers guided the robots through specific areas in real-world settings, emphasizing key locations to remember. The robots were then tasked with leading the trainers to these locations, showcasing their ability to understand and follow multimodal instructions effectively.
Free For All: Google To Make its Dark Web Monitoring Tool
Bhashini's Vaani Project: Open-Sourcing Speech Data Across India