Submitted by Muhammad Khalifa 1 Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation LAUNCH Lab 1