Skip to content

Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782

Open
singalsu wants to merge 3 commits into
thesofproject:mainfrom
singalsu:mfcc_add_vad
Open

Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782
singalsu wants to merge 3 commits into
thesofproject:mainfrom
singalsu:mfcc_add_vad

Conversation

@singalsu
Copy link
Copy Markdown
Collaborator

@singalsu singalsu commented May 15, 2026

No description provided.

@singalsu
Copy link
Copy Markdown
Collaborator Author

This is still WIP. I'd like to add a better audio feature header to the fake PCM stream. In successive PRs should start to use the compress PCM type for MFCC output data. The MFCC config blob could enable for VAD mode discontinuous data. E.g. once per second background noise Mel spectrum values, for speech detected at FFT hop rate, e.g. every 10 ms.

Comment thread src/audio/mfcc/mfcc_vad.c
/* Find j such that a_weight_hz[j] <= f_hz < a_weight_hz[j+1] */
for (j = 0; j < A_WEIGHT_TABLE_SIZE - 2; j++) {
if (f_hz < a_weight_hz[j + 1])
break;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary search?

Copy link
Copy Markdown
Collaborator Author

@singalsu singalsu May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be implemented with some binary search function? It's a very small table (36 values) and this is initialization time code, not hot.

Comment thread src/audio/mfcc/mfcc_vad.c
Comment thread src/audio/mfcc/mfcc_vad.c Outdated
Comment thread src/audio/mfcc/mfcc_vad.c Outdated
Comment thread src/audio/mfcc/mfcc_vad.c Outdated
@singalsu singalsu marked this pull request as ready for review May 19, 2026 11:11
Copilot AI review requested due to automatic review settings May 19, 2026 11:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an optional MFCC Voice Activity Detection (VAD) feature that runs on the MFCC component’s Mel log spectrum and embeds a VAD flag into the MFCC/Mel output stream, along with updated host-side tuning/decoding tooling and documentation.

Changes:

  • Add a new mfcc_vad module (state, initialization, per-frame update) and wire it into MFCC Mel-log-spectrum processing.
  • Insert a per-frame VAD flag into the MFCC output stream immediately after the magic header word (gated by a new Kconfig option).
  • Update tuning tools/documentation: add a live DSP-VAD-triggered Whisper transcription script, migrate README to Markdown, and extend decode_mel.m to extract VAD.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/include/sof/audio/mfcc/mfcc_vad.h New public header for VAD state + API and tuning constants
src/audio/mfcc/mfcc_vad.c New VAD implementation (noise floor tracking + weighted energy delta + hangover)
src/include/sof/audio/mfcc/mfcc_comp.h Extend MFCC component state to carry VAD state and output bookkeeping
src/audio/mfcc/mfcc_common.c Run VAD during Mel processing and emit VAD flag in stream output
src/audio/mfcc/mfcc_setup.c Initialize/free VAD resources during MFCC setup/teardown
src/audio/mfcc/Kconfig Add CONFIG_COMP_MFCC_VAD option controlling build + format change
src/audio/mfcc/CMakeLists.txt Conditionally compile mfcc_vad.c
src/arch/host/configs/library_defconfig Enable VAD in host library defconfig
src/audio/mfcc/tune/sof_mel_to_text_live_dsp_vad.py New live capture + Whisper transcription tool using DSP-embedded VAD
src/audio/mfcc/tune/README.md New Markdown documentation (replaces README.txt)
src/audio/mfcc/tune/decode_mel.m Extend Mel decoder to parse VAD flag and plot it
Comments suppressed due to low confidence (1)

src/audio/mfcc/mfcc_common.c:297

  • vad_pending is only set for state->mel_only. If VAD is meant to be emitted for all MFCC output frames (including cepstral output), this needs to be set for the non-mel_only path too; otherwise, please update docs to state the VAD flag is only present in Mel-log-spectrum output streams.
		if (state->mel_only) {
			state->out_data_ptr = state->mel_spectra->data;
#ifdef CONFIG_COMP_MFCC_VAD
			state->vad_pending = true;
#endif

Comment thread src/include/sof/audio/mfcc/mfcc_vad.h Outdated
Comment thread src/audio/mfcc/Kconfig Outdated
Comment thread src/audio/mfcc/tune/sof_mel_to_text_live_dsp_vad.py
Comment thread src/audio/mfcc/mfcc_common.c Outdated
Comment thread src/audio/mfcc/mfcc_setup.c Outdated
Comment thread src/audio/mfcc/tune/decode_mel.m
Comment thread src/audio/mfcc/tune/decode_mel.m Outdated
Comment thread src/audio/mfcc/tune/sof_mel_to_text_live_dsp_vad.py Outdated
@singalsu
Copy link
Copy Markdown
Collaborator Author

I think I'll remove the CONFIG_COMP_MFCC_VAD and build it always. Then it's simpler to make it a permanent part of the magic header. The configuration blob for Mel mode can enable computing it,while in MFCC mode it will be zeros unless enabled there also with blob. Then the parsing scripts can always use the same data format.

@singalsu singalsu marked this pull request as draft May 19, 2026 15:02
@singalsu
Copy link
Copy Markdown
Collaborator Author

Adding more features --> draft

@lyakh
Copy link
Copy Markdown
Collaborator

lyakh commented May 20, 2026

I think I'll remove the CONFIG_COMP_MFCC_VAD and build it always. Then it's simpler to make it a permanent part of the magic header. The configuration blob for Mel mode can enable computing it,while in MFCC mode it will be zeros unless enabled there also with blob. Then the parsing scripts can always use the same data format.

@singalsu another option would be to keep the CONFIG_... but make it "y" by default

@singalsu
Copy link
Copy Markdown
Collaborator Author

I think I'll remove the CONFIG_COMP_MFCC_VAD and build it always. Then it's simpler to make it a permanent part of the magic header. The configuration blob for Mel mode can enable computing it,while in MFCC mode it will be zeros unless enabled there also with blob. Then the parsing scripts can always use the same data format.

@singalsu another option would be to keep the CONFIG_... but make it "y" by default

True, but I find testing all the kconfig and blob config permutations very time consuming. So better to reduce variation when possible. VAD is so small that I don't think it matters. And the blob can switch it off like it now does for MFCC ceps mode.

@singalsu singalsu marked this pull request as ready for review May 20, 2026 10:03
@singalsu singalsu requested review from a team, jsarha and ranj063 as code owners May 20, 2026 10:03
@singalsu singalsu requested a review from Copilot May 20, 2026 10:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 9 comments.

Comment thread src/include/sof/audio/mfcc/mfcc_comp.h
Comment thread src/include/sof/audio/mfcc/mfcc_comp.h
Comment thread src/audio/mfcc/mfcc_common.c
Comment thread src/audio/mfcc/mfcc_ipc4.c Outdated
Comment thread src/audio/mfcc/CMakeLists.txt
Comment thread src/audio/mfcc/tune/decode_all.m
Comment thread src/audio/mfcc/tune/decode_mel.m Outdated
Comment thread src/include/sof/audio/mfcc/mfcc_vad.h
Comment thread src/audio/mfcc/mfcc_setup.c
singalsu added 3 commits May 20, 2026 15:16
Add mfcc_vad module with A-weighted energy-based voice activity
detection that operates on the Mel log spectrum produced by the MFCC
component. The algorithm tracks a per-bin noise floor with instant-down
and slow-rise behavior, then computes a weighted energy delta above
the floor. Speech is declared when the delta exceeds a threshold
(0.35 in Q9.23) with a 20-frame hangover to prevent rapid toggling.
The VAD is gated on the new enable_vad flag in sof_mfcc_config.

Add struct mfcc_data_header with six int32 fields (magic,
frame_number, reserved, energy, noise_energy, vad_flag) prepended to
every output frame in all format paths (S16, S24, S32). This replaces
the previous magic-word-only header. The header carries the VAD
decision and energy values from the DSP for downstream consumers.

Extend sof_mfcc_config in user/mfcc.h with reserved16[3] padding for
32-bit alignment, and new boolean fields enable_vad, enable_dtx,
update_controls, and reserved_bool[5]. The config blob size increases
from 104 to 116 bytes.

Update Matlab/Octave decode scripts (decode_mel.m, decode_ceps.m,
decode_all.m) and setup_mfcc.m for the expanded header and config
struct. Regenerate topology2 configuration blobs (default.conf,
mel80.conf) with the new blob size.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Add sof_mel_to_text_live_dsp_vad.py that captures mel spectrogram frames
from ALSA with embedded DSP VAD flag and performs live speech-to-text
transcription using OpenVINO Whisper. The script buffers mel frames during
speech and triggers Whisper inference when silence is detected after
speech. Capture runs continuously in a separate thread during inference
to avoid frame drops.

Replace the old README.txt with a comprehensive README.md that documents
the MFCC tuning tools, testbench usage with run_mfcc.sh, output file
formats, Matlab/Octave decode and plotting scripts, and the new live
transcription workflow.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Add IPC4 notification that sends the VAD state to user space via a
switch control whenever the VAD decision changes between speech and
silence. The notification is initialized during prepare and sent from
the audio processing path on VAD state transitions.

The implementation follows the TDFB/sound_dose notification pattern:
mfcc_ipc4.c contains the IPC4-specific notification init and send
functions, while mfcc.c provides weak stubs so IPC3 builds link
without the IPC4 dependencies.

Add handling for SOF_IPC4_SWITCH_CONTROL_PARAM_ID in mfcc_get_config
and mfcc_set_config so the kernel driver can read back the current VAD
state after receiving a notification. The switch control is read-only
from the DSP side.

Both the notification init and the VAD state change detection are
gated on the update_controls flag in the configuration blob struct.

Add a switch control (mixer) to the MFCC topology2 widget definition
for the VAD notification.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.

print(f"Whisper model: {model_path} (encoder: {encoder_device}, decoder: {decoder_device})")
print()

proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
if not data:
rc = proc.poll()
if rc is not None:
stderr_out = proc.stderr.read().decode(errors='replace')
function [mel, t, n] = decode_mel(fn, num_mel, fmt, num_channels)
function [mel, t, n, vad, energy, noise_energy, frame_number] = ...
decode_mel(fn, num_mel, fmt, num_channels)

Comment on lines +74 to +78
frame_number = mod(payload(1,:), 65536) + payload(2,:) * 65536;
% payload(3:4,:) is reserved, skip
energy = mod(payload(5,:), 65536) + payload(6,:) * 65536;
noise_energy = mod(payload(7,:), 65536) + payload(8,:) * 65536;
vad = mod(payload(9,:), 65536) + payload(10,:) * 65536;
Comment on lines +180 to +181
if (config->enable_vad)
mfcc_vad_update(&cd->vad, state->mel_log_32);
Comment on lines +38 to +39
/** \brief Switch control index for VAD notification to user space */
#define MFCC_CTRL_INDEX_VAD 0
Comment on lines +41 to +54
/**
* \brief Data header prepended to every MFCC output frame.
*
* Written before the Mel spectrum or cepstral coefficient data in each
* output frame. All fields are int32_t so the header is 24 bytes.
*/
struct mfcc_data_header {
uint32_t magic; /**< Magic word MFCC_MAGIC (0x6d666363) */
uint32_t frame_number; /**< Frame number, counting calculated frames starting from 0 */
int32_t reserved; /**< Reserved for future use, set to 0 */
int32_t energy; /**< Weighted signal energy in Q9.23 */
int32_t noise_energy; /**< Weighted noise floor energy in Q9.23 */
int32_t vad_flag; /**< VAD decision: 1 = speech, 0 = silence */
};
Comment thread src/audio/mfcc/mfcc_vad.c
Comment on lines +73 to +75
int16_t f_hz, f0, f1, w, w0, w1, den;
int16_t mel_end = psy_hz_to_mel((int16_t)(sample_rate / 2));
int16_t mel_step = mel_end / (num_mel + 1);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants