SoundNet: Learning Sound Representations from Unlabeled Video

vision to train the student model in sound. [9] also transfer visual supervision into depth models. Cross-Modal Learning and Unlabeled Video: Our approach is broadly inspired by efforts to model cross-modal relations [24, 14, 7, 26] and works that leverage large amounts of unlabeled video [25, 41, 8, 40, 39]. ................
................