Field of Invention
The present invention relates to a system for audio signal
processing with reduced feedback and dereverberation and a method for reducing feedback
and dereverberation in microphone signals. The invention, in particular, relates
to combined feedback compensation and multi-channel dereverbaration in communication
systems installed in vehicles.
Background of the invention
Verbal communication is often affected by a noisy background.
A prominent example is communication of passengers in vehicles. In particular, at
high traveling speeds dialogs between back and front passengers are easily disturbed
by background noise. If additional audio signals, e.g., from a radio or a CD player,
are output during the conversation, the intelligibility of the utterances deteriorates
further.
To improve the intelligibility of the passengers' utterances
communication system comprising speech processing means have recently been employed
in vehicular cabins. Microphones installed near each seat of the cabin can detect
speech signals from the passengers. The microphone signals are processed by a speech
signal processing means and output by loudspeakers. Speech signals from a back passenger
are preferably output by front loudspeakers.
However, it is necessary to reduce acoustic feedback that
otherwise can cause, e.g., unpleasant echoes. In the worst case, acoustic feedback
can even result in a complete breakdown of communication. Reliable feedback reduction
means cannot easily be designed, e.g., since the reverberating characteristics of
vehicular cabins are rather complex.
Communication systems installed in vehicles typically also
comprise audio devices as, e.g., a radio. If audio signals are output by the loudspeakers,
e.g., due to the operation of a radio, the microphones installed in the vehicular
cabin detect at least some of the audio signals in addition to the speech signals
of the passengers. The detected audio signals have to be sufficiently damped, since
otherwise the closed electro-acoustic loop would result in significant reverberation
of the audio reproduction.
Adaptive filters have been employed for feedback reduction
as well as feedback compensation of audio signals. However, such filters do not
work sufficiently reliable, since the automatic adaptation of the filter coefficients
suffers from severe correlation problems. Therefore, the adaptive filters do not
guarantee the high quality of speech signals that is necessary for an electronically
aided verbal communication in vehicular cabins, for example.
Thus, it is an object of the present invention to provide
an improved system for audio signal processing and an improved method for feedback
reduction in microphone signals, in particular, to facilitate passengers' conversation
in vehicular cabins and to avoid reverberation of audio signals from audio devices
as, e.g., a radio or a DVD player.
Description of the invention
The above mentioned object is achieved by the inventive
method for processing of audio signals in an audio system comprising at least one
microphone and at least one loudspeaker, wherein the audio signals comprise audio
signals of a first class, in particular, from a first audio device, and audio signals
of a second class, in particular, from a second audio device, comprising the steps
of:
- decorrelating the audio signals of the first class;
- estimating for each loudspeaker an impulse response between the at least one
microphone and the at least one loudspeaker on the basis of the decorrelated audio
signals;
- filtering a microphone signal on the basis of each estimated impulse response
and the decorrelated audio signals to obtain a noise compensated signal; and
- filtering the noise compensated signal on the basis of each estimated impulse
response and the audio signals of the second class to obtain an output signal.
The audio system can, for example, represent a communication
system in a vehicle usable for electrically supporting verbal communication between
speakers, e.g., between passengers in a vehicle, and for concurrently outputting
audio signals from an audio device like a radio. The output signal can be further
processed before it is eventually output by the at least one loudspeaker. The first
audio device can be a radio or a CD player or a DVD player, for example. Audio signals
of the second class may comprise signals that represent verbal utterances, in particular,
by passenger of a vehicle. The second audio device can be a speech output device
outputting processed microphone signals that represent verbal utterances.
The audio signals of the first class are preferably decorrelated
by means of non-linear processing means and/or a time-dependent filtering means.
Thus the signals are processed by some non-linear mapping, e.g., in form of a halfwave
rectification, and/or they are, e.g., filtered by a finite impulse response (FIR)
filter with two different sets of filter coefficients that may be switched during
a short interpolation phase.
For each loudspeaker channel an impulse response between
the microphones and the loudspeaker can be estimated, i.e. the respective automatically
adapted filter coefficients of an adaptive filtering means are calculated, as known
in the art, in order to enable efficient reduction/compensation of the decorrelated
audio signals of the first class of audio signals in the microphone signal that
is processed by the adaptive filtering means.
Adaptation of the filter coefficients can, e.g., be achieved
by the gradient method employing a suitable cost function as known in the art (see,
e.g.,
Acoustic Echo and Noise Control, by E. Hänsler and G. Schmidt, John Wiley
& Sons, New York, 2004
;
Adaptive Filter Theory, by Haykin, S., Prentice Hill, New Jersey, 2002
).
The noise compensated signal represents a filtered microphone
signal compensated for noise in the form of the estimated contribution of the audio
signals of the first class output by the loudspeakers to the microphone signal.
By filtering of the microphone signal subtraction of undesirable perturbations in
form of a particular signal from the microphone signal is meant. The particular
signal that is subtracted from the microphone signal in order to obtain the noise
compensated or error signal is obtained by the sum of the convolution of the estimated
impulse response for each loudspeaker channel that is obtained after adaptation
of the respective filter coefficients has been completed with the decorrelated audio
signal of the respective loudspeaker channel.
The estimated impulse response for each loudspeaker channel
is convolved with the respective audio signal of the second class of audio signals
and the sum of the results from the convolution procedures is subtracted from the
noise compensated signal in order to obtain an output signal. Thus, the output signal
o(n) (with n being a discrete time index), i.e. the filtered noise
compensated signal, is determined from the microphone signal d(n) for
M loudspeakers by
where d̂
e,i
(n) is obtained from the convolution of a vector containing the last
N signals of the i-th loudspeaker channel of a first audio device
with the estimated impulse response vector containing the filter coefficients of
the adaptive filtering means (convolution procedure) and d̂
e,i
(n) is obtained from the convolution of a vector containing the last
N audio signals of the second class, e.g., output signals of a speech output
device, with the same estimated impulse response vector as above.
The inventive method, thus, represents a new combined method
for compensation of audio signals of a first class and for reducing feedback of
audio signals of a second class utilizing the same filter coefficients. Thereby,
a very efficient dereverberation and feedback reduction is achieved, even if audio
signals of the first class and of the second class are concurrently processed by
the audio system and the processed audio signals are concurrently output by the
loudspeakers. Automatic adaptation of the filter coefficients of the filtering means
provided to obtain noise compensated signals from the microphone signals on the
basis of decorrelated audio signals and to obtain output signals from noise compensated
signals on the basis of audio signals from a second class does not suffer significantly
from correlation problems.
According to an advantageous embodiment the short-time
power of the microphone signal, the noise compensated signal and the output signal
are automatically determined and each estimated impulse response, i.e. at least
one estimated impulse response, if only one loudspeaker is present, is automatically
multiplied with a factor less than one, if the short-time power of the noise compensated
signal or of the output signal is higher than the short-time power of the microphone
signal.
By this step, stability of the inventive method can be
guaranteed. In particular, if the short-time power of the noise compensated signal
is higher than the short-time power of the microphone signal, the estimated impulse
response can be weighted by a leakage factor &rgr;
e
with 0 < &rgr;
e
< 1, and if the short-time power of the output signal is higher than the
short-time power of the microphone signal the estimated impulse response can also
be weighted by a leakage factor &rgr;
ẽ
with 0 < &rgr;̃
ẽ
< &rgr;
e
< 1.
If more than one microphone is present, it may be preferred
to beamform the different microphone signals that may be detected by microphone
arrays, in order to obtain a beamformed microphone signal with an enhanced signal-to-noise
ratio as known in the art. By using multiple microphones, the different spatial
characteristics of speech and noise can be exploited and in this way the background
noise can be suppressed.
A conventional delay-and-sum-beamformer may be used. Alternatively,
an adaptive weighted sum beamformer may be employed that combines the pre-processed,
in particular, time aligned signals a
m
of M microphones to obtain one output signal d with an improved
signal-to-noise ratio
with weights A
m
that are not time-independent, but have to be recalculated repeatedly as
is required, e.g., to maintain sensitivity in the desired direction and to minimize
sensitivity in the directions of noise sources. Feedback reduction and dereverberation
of a beamformed microphone signal instead of a non-beamformed one can further improve
the quality of the output signals.
The audio signals of the first class may be transmitted
by an audio device, in particular, a radio or a CD player or a DVD player and/or
the audio signals of the second class comprise at least one speech signal from a
speech output device. This can be the case, e.g. in a communication system installed
in a vehicle that allows for electrically aided verbal communication of the passengers
and also for the playback of DVDs, CDs and radio programs. Feedback of the verbal
inputs by the passengers representing audio signals of the second class can sufficiently
be reduced and reverberation of the output audio signals transmitted by one of the
above-mentioned audio devices can be avoided.
Embodiments of the inventive method are particular suitable
for a situation, wherein the microphones comprise at least a first set of microphones
comprising at least one microphone and a second set of microphones comprising at
least one microphone and the loudspeakers comprise at least a first set of loudspeaker
located close to the first set of microphones and comprising at least one loudspeaker
and a second set of loudspeakers located close to the second set of microphones
and comprising at least one loudspeaker, in which case the audio signals of the
second class of audio signals can be detected by at least the first set of microphones
to obtain the microphone signal to be processed and output as the output signal
gained by the two filtering steps described above by at least the second set of
loudspeakers.
A method according to this example is of particular use,
if the audio signals of the second class are speech signals from a speech output
device being part of communication system that, e.g., is installed in a vehicle.
Microphones installed close to a back passenger may detect utterances by this passenger
and the processed speech signals can be output by loudspeakers installed close to
the driver and/or another front passenger. Thereby, the intelligibility of the verbal
communication between front and back passengers can significantly be improved.
The above describes processing can be performed in the
time domain or in the frequency domain or in subbands.
The invention also provides a computer program comprising
one or more computer readable media having computer-executable instructions for
performing the above-mentioned steps of embodiments of the inventive method for
audio signal processing.
The above mentioned object is also achieved by a system
for signal processing of audio signals that comprise audio signals of a first class,
in particular, from a first audio device, and audio signals of a second class, in
particular, from a second audio device, comprising
at least one microphone to obtain a microphone signal and at least one loudspeaker;
a pre-processing means configured to receive and to decorrelate the audio signals
of the first class of audio signals;
a first signal processing means comprising adapted filter coefficients configured
to filter the microphone signal based on the decorrelated audio signals to obtain
a noise compensated signal;
a second signal processing means configured to filter the noise compensated signal
by means of the same filter coefficients used by the first signal processing means
and on basis of the audio signals of the second class of audio signals to obtain
an output signal.
It may be preferred to use more than one microphone, in
particular, more than one microphone array. The first filtering means comprises
filter coefficients that are automatically adapted time-dependently to model the
finite response between the loudspeakers and the microphones (one finite response
for each loudspeaker channel, see description above). The first and the second signal
processing means are adaptive filters that may or may not be physically separate
units.
The system may further comprise means configured to determine
the short-time power of the microphone signal, the noise compensated signal and
the output signal and the calculation/adaptation of the filter coefficients of the
first signal processing means may comprise multiplication of the filter coefficients
with a factor less than one, if the short-time power of the noise compensated signal
or of the output signal is higher than the short-time power of the microphone signal.
The pre-processing means may comprise a non-linear processing
means, in particular, a halfwave rectifier, and/or a time-dependent filtering means.
According to an embodiment of the inventive system, a beamforming
means configured to generate a beamformed microphone signal is further employed.
The pre-processing means and/or the first and/or the second
processing means can be configured to perform processing in the time domain or in
the frequency domain or in subbands.
Furthermore, the invention provides a communication system,
in particular, for use in a vehicular cabin, comprising
a system for signal processing of audio signals according to one of the above described
embodiments;
at least one first audio device, in particular, a radio or a CD player or a DVD
player, that transmits the audio signals of the first class; and
at least one second audio device, in particular, a speech output device, that transmits
the audio signals of the second class.
Embodiments of the inventive system for audio signal processing
are particularly useful for communication systems installed in vehicular cabins.
Feedback reduction of the verbal inputs by the passengers and dereverberation of,
e.g., audio signals from a radio device result in a better quality of audio signal
reproduction and intelligibility of verbal utterances.
Additional features and advantages of the invention will
be described with reference to the drawings:
- Fig. 1 shows a communication system for a vehicular cabin comprising microphones,
loudspeakers, a radio and signal processing means.
- Fig. 2 illustrates time-dependent filtering of radio signals in order to decorrelate
the signals.
- Fig. 3 illustrates an example of the inventive system and method for processing
audio signals comprising filtering of a microphone signal on the basis of decorrelated
audio signals from a first device and speech signals from a second device utilizing
the same filter coefficients.
- Fig. 4 illustrates an example of the inventive system and method comprising
filtering of a beamformed microphone signal on the basis of decorrelated audio signals
from a first device and speech signals from a second device utilizing the same filter
coefficients and a stability check,
- Fig. 1 schematically illustrates an example of a communication system installed
in a vehicular cabin 1, comprising loudspeakers 2 and microphones 3 to detect verbal
utterances respectively mounted close to a driver 4, a front passenger 5 and back
passengers 6. The communication system also comprises an audio device 9, which according
to this example is a radio.
If a back passenger 6 is in dialog with the front passenger
5 the conversation is aided by the communication system by detecting the passengers'
utterances by means of the microphones 3 close to the passengers 5 or 6 respectively,
signal processing the microphone signals and outputting the processed signals to
the loudspeakers close to the passengers 4 or 5 respectively.
A signal processing means is logically and physically divided
into a means 7 processing signals detected by the backward microphones and a means
8 processing the signals detected by the front microphones. Alternatively, the signal
processing means can be formed as one single physical device. Both means 7 and 8
are also connected with the radio 9 and thus output of both the processed speech
signals and audio signals from the radio (radio signals) is provided by the means
7 and 8. Processed speech signals and audio signals from the radio are output concurrently
by the loudspeakers 2. Additional audio sources, e.g., a CD or a DVD player may
be present and represented by the reference number 9 and CDs or DVDs may be played-back
via the means 7 and 8.
According to the present invention, radio signals output
by the loudspeakers 2 are efficiently compensated, i.e. damped after they have been
received by the microphones 3 and input in the means 7 and 8. Feedback of speech
signals output by at least one of the loudspeakers 2 and input in the means 7 and/or
8 via microphones 3 is efficiently reduced.
Compensation of the radio signals, i.e. dereverberation,
and feedback reduction are performed by means of adaptive filters comprising sets
of filter coefficients for each of the loudspeakers 2.
For example, the microphones 3 detect radio signals output
by the loudspeakers 2. These detected signals are beamformed as known in the art
to obtain a beamformed signal d(n) (n denotes the discrete time index).
For the shown four loudspeakers 2 compensation could, in principle, be carried out
by the subtraction
where N is the filter length, i.e. the number of coefficients, e.g., some hundred,
ĥ
i,k
(n) that constitute the estimated impulse responses ĥ
i
(n). The estimated impulse responses ĥ
i
(n) models the real impulse responses h
i
(n) of the system of the loudspeakers 2, the microphones 3 and the
vehicular cabin 1. By x̃
i
(n) the radio signals of four channels according to the four loudspeakers
are denoted. The finite difference e(n) is usually called error signal.
However, since several kinds of radio signals show strong
correlations, in particular, in terms of cross correlation or coherence, numerical
algorithms used for the adaptation of the filter coefficients do not necessarily
converge to the desired impulse responses of the entire loudspeaker-room-microphone-system.
This problem is well-known in the art and mathematically speaking it is caused by
the non-uniqueness of the optimization problem to be solved for adaptation of the
filters.
Decorrelation of the signals x̃
i
(n) can be achieved by some non-linear mapping (see, e.g.,
Investigation of Several Types of Nonlinearities for Use in Stereo Acoustic
Echo Cancellation, by Morgan, D.R., Hall, J.L. and Benesty, J., IEEE Transactions
on Speech and Audio Processing, Vol. 9, No. 6, p.686, 2001
) e.g., half wave rectification, for each channel i:
Increasing the parameter a results in faster convergence
but also in a higher distortion factor (ripple-factor). Typically, a is chosen between
0.3 and 0.7.
Alternatively, the radio signals can be filtered by a time-dependent
filter before being output by the loudspeakers 2 (see, e.g.,
A Stereo Echo Canceller with Correct Echo-Path Identification Based on Input-Sliding
Technique, by Sugiyama, A., Joncour, Y and Hirano, A., IEEE Transactions on Signal
Processing, Vol. 49, No. 1, p. 2577, 2001
). A finite impulse response filter with two coefficients c(n) and
1 - c(n) for each discrete time index n may be used. Signals
are passed without delay for some period, the coefficients being 1 and 0,
and then delay by one clock is realized by switching the coefficients to
0 and 1 (see Fig. 2). The switching process is preferably carried
out during a short interpolation phase rather than abruptly. Moreover, time-dependent
filters may be employed that only result in phase variations. Combinations of non-linear
processing and time-dependent filtering are possible.
It should be noted that the real system identification
achieved by the decorrelation is necessary to use the estimated impulse response
to subsequently reduce feedback (s. below). Thus, instead of x̃
i
(n) the decorrelated signals audio signals x
i
(n) are used to adapt the filter coefficients and to obtain the error
signal e(n).
Further steps of the inventive method are explained with
reference to Fig. 3 that illustrates a method and a system for signal processing
that can be realized in the communication system shown in Fig. 1. As mentioned above
radio signals x̃
i
(n) from a radio 10 (or signals from any other audio device, e.g., a DVD
player) are pre-processed by a non-linear pre-processing means or a time-dependent
filtering means 11 as mentioned above to obtain decorrelated signals x
i
(n) to be output by the loudspeaker 2. The signals output by the loudspeaker
2 are detected by the microphone 3 according to some real impulse response
h
i
(n). The index i enumerates the loudspeakers or loudspeaker
channels of the radio 10 respectively. Whereas, for simplicity, in this example
only one microphone 3 and one loudspeaker 2 are shown, in the following equations
the index i is still used to facilitate the understanding for the more general
case of multiple loudspeakers.
The microphone signal d(n) is processed by an adaptive
(radio compensation) filtering means 13 to obtain an error signal
where
denotes the estimated impulse response of the i-th channel and
is the vector of the last N signals from the radio (the upper index T denotes
the transposition operation). The estimated impulse response, which is only one
single response, i.e. i=M=1, since only one loudspeaker (channel) is considered
in this example, is corrected in direction of the negative gradient of an appropriately
chosen cost function J(e(n)):
Adaptation can be realized using a normalized least mean square method or a recursive
least square algorithm (for detail see, e.g.,
Adaptive Filter Theory, by Haykin, S., Prentice Hill, New Jersey, 2002
). In the simplest case the cost function is given by J(e(n))=E{e
2
(n)} with E being the expectation value.
After the impulse response ĥ
i
(n) between the loudspeaker 2 and the microphones 3 has been estimated, the
feedback components of an audio signal from a second audio device 15 can also be
estimated using the same estimated impulse response used for the compensation of
the radio signals by the means 13, i.e. a feedback reduction means 14 with filter
coefficients ĝ
i(n) = ĥ
i (n) calculates
from the vector y
i(n) that contains the last N output signals of the second audio
device, e.g., a speech output device, that is part of the communication system shown
in Fig. 1. Thus, the error signal e(n) is further enhanced by subtracting
the feedback components coming from the output signal from the speech output device
y
i
(n):
Whereas the radio signals x
i
(n) are usually output by all of the loudspeakers, the output signals of
the communication system y
i
(n) are usually output by the loudspeaker(s) close to the listening communication
partner only. The respective output signals of the communication system
y
i
(n) of the other channels are set to null.
Fig. 4 illustrates a generalized example for the herein
disclosed method and system. Microphone signals are obtained by three microphones
3 and they are pre-processed by a beamforming means 12 that combines the microphone
signals to one beamformed signal d(n) with an enhanced signal-to-noise ratio
that is further processed by an adaptive (radio compensation) filtering means 13
to obtain an error signal e(n). In this example the single loudspeaker 2
represents an arbitrary number of loudspeaker, i.e. M > 1. Processing
involves similar steps as described above with i = 1, ..,
M.
However, according to the present example the stability
of the adaptation of the filter coefficients is checked 16. For this, the short
time powers p
d
(n), p
e
(n) and p
ẽ
(n) of the signals d(n), e(n) and ẽ(n),
respectively, are calculated and, for example, recursively filtering of the squares
of the respective signals is performed by
where the time constant &lgr; is chosen to be between 0.95 and 0.999. Check for
stability is performed by two steps. First, p
d
(n) and p
e
(n) are compared with each other. If the short time power of the error signal
is slightly higher than the one of the beamformed microphone signal, say
p
e
(n) > K p
d
(n) with the constant K to be chosen between 1 dB and 3 dB, the estimated
impulse response is weighted by a factor &rgr;e (leakage factor) with 0 < &rgr;e
< 1.
Second, p
ẽ
(n) and p
d
(n) are compared and if p
ẽ
(n) > K p
d
(n), the estimated impulse response is also weighted by a factor
&rgr;
ẽ
with 0 <&rgr;
ẽ
< &rgr;
e
< 1. The weighting of the estimated impulse response results in a sufficient
damping of the estimated impulse response within a few sample rates in the case
that an instability is excited.
Whereas above the processing of audio signals is described
in the time domain as indicated by the discrete time index n, processing of the
digitized and Fourier transformed microphone signal or beamformed microphone signal
respectively in the frequency domain may be preferred.
The previously discussed examples are not intended as limitations
but serve as examples illustrating features and advantages of the invention. It
is to be understood that some or all of the above described features can also be
combined in different ways.