ELSA: Acoustic Event-Wise Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

Author One1 Author Two2 Author Three1,2
1Institution One 2Institution Two
Teaser Image showing ELSA concept

Figure 1: Overview of existing reference-free metrics versus our proposed ELSA. ELSA captures fine-grained acoustic events aligned with semantic content.

Abstract

Text-to-Audio (TTA) generation has seen significant progress, but evaluating these systems remains challenging. Existing reference-free metrics often focus on global semantic alignment, neglecting fine-grained acoustic events. In this paper, we introduce ELSA (Event-Wise Semantic Alignment), a novel metric designed to evaluate TTA models by aligning acoustic events with their corresponding textual descriptions. Our experiments demonstrate that ELSA correlates better with human judgment compared to state-of-the-art baselines.

Method

The ELSA framework leverages advanced alignment techniques to map specific acoustic events in the generated audio to semantic units in the text prompt. This allows for a more granular evaluation of audio quality and relevance.

ELSA Model Architecture

Figure 2: The architecture of the ELSA metric.

Demo Pipeline

Visualize the ELSA evaluation pipeline: from text prompt to audio generation to score calculation.

Text Prompt
"A large dog barking aggressively."
Generated Audio
ELSA Score
0.85
Baseline: 0.60
Text Prompt
"People clapping and cheering in a stadium."
Generated Audio
ELSA Score
0.92
Baseline: 0.75

Experimental Results

Swipe to see our comparison with state-of-the-art metrics and ablation studies.

Analysis

Detailed analysis of event sensitivity compared to previous approaches.

Event Sensitivity

ELSA Event Sensitivity

Previous Approaches

Previous Approaches

Citation

@inproceedings{elsa2026,
  title={ELSA: Acoustic Event-Wise Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation},
  author={Author One and Author Two and Author Three},
  booktitle={Interspeech 2026},
  year={2026}
}