Vision-language models trained on Internet-scale data incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. The paper's goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and use the benefits of large-scale pretraining.