Tom proposes encoding video as code so LLMs can manipulate a three-dimensional scene and render it down efficiently.