Cannot See the Forest for the Trees: Aggregating Multiple Viewpoints to Better Classify Objects in Videos