Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

Abstract

Visual Text-to-Speech (VTTS) aims to take the spatial environmental image as the prompt to synthesize the reverberation speech for the spoken content. Previous research focused on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address the issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS²KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and semantic captions from image understanding LLM as supplementary sources. Afterwards, we propose a serial interaction mechanism to deeply engage with both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on their contributions. This enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive spatial speech experience. Experimental results demonstrate that the MS²KU-VTTS surpasses existing baselines in generating immersive speech. Demos and code are available at: https://github.com/MS2KU-VTTS/MS2KU-VTTS.

Systems	Recording	DiffSpeech[1]	ProDiff[2]	VoiceLDM[3]	ViT-TTS[4]	MS2KU-VTTS (ours)
Sample 1:
Target RGB/Depth Image:
Reference Text:	finally the one party went off exulting
Caption Text:	This panoramic image depicts a spacious outdoor dining area. A long wooden table, surrounded by white chairs, occupies the center. To the left and right of the table, there are sliding glass doors leading to a covered patio area with a thatched roof. The patio is adorned with hanging plants and offers views of lush greenery beyond. The overall ambiance is serene and inviting.
Systems	Recording	DiffSpeech[1]	ProDiff[2]	VoiceLDM[3]	ViT-TTS[4]	MS2KU-VTTS (ours)
Sample 2:
Target RGB/Depth Image:
Reference Text:	meanwhile rodolfo had leocadia
Caption Text:	This panoramic image showcases a backyard scene. In the foreground, a wooden deck surrounds a blue swimming pool. To the left, a tall tree stands tall, and a small shed is visible in the background. On the right, a house with a patio is seen. The overall atmosphere is peaceful and suburban.
Systems	Recording	DiffSpeech[1]	ProDiff[2]	VoiceLDM[3]	ViT-TTS[4]	MS2KU-VTTS (ours)
Sample 3:
Target RGB/Depth Image:
Reference Text:	it is the only amends i ask of you for
Caption Text:	This panoramic image depicts a spacious basement apartment. A man stands in the center of the room, facing a kitchen area to the right. To the left, there is a hallway with a door leading to other rooms. The walls are painted white, and the floors are covered in hardwood. The overall atmosphere is clean and modern.

Systems

Recording

MS2KU-VTTS (ours)

Sample 1:

Target RGB/Depth Image:

Reference Text:

finally the one party went off exulting

Caption Text:

This panoramic image depicts a spacious outdoor dining area. A long wooden table, surrounded by white chairs, occupies the center. To the left and right of the table, there are sliding glass doors leading to a covered patio area with a thatched roof. The patio is adorned with hanging plants and offers views of lush greenery beyond. The overall ambiance is serene and inviting.

Systems

Recording

MS2KU-VTTS (ours)

Sample 2:

Target RGB/Depth Image:

Reference Text:

meanwhile rodolfo had leocadia

Caption Text:

This panoramic image showcases a backyard scene. In the foreground, a wooden deck surrounds a blue swimming pool. To the left, a tall tree stands tall, and a small shed is visible in the background. On the right, a house with a patio is seen. The overall atmosphere is peaceful and suburban.

Systems

Recording

MS2KU-VTTS (ours)

Sample 3:

Target RGB/Depth Image:

Reference Text:

it is the only amends i ask of you for

Caption Text:

This panoramic image depicts a spacious basement apartment. A man stands in the center of the room, facing a kitchen area to the right. To the left, there is a hallway with a door leading to other rooms. The walls are painted white, and the floors are covered in hardwood. The overall atmosphere is clean and modern.

Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

Abstract

MODEL ARCHITECTURE

BASELINE MODELS

REFERENCES