Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with \langle \text{cc}\rangle symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its... Show more