Mariella Baldussi, Editor
ISO/IEC 13818-3 International Standard 1995

DIGITAL AUDIO: Notes about MPEG 2

Annex D

(informative)

Psychoacoustic models

D.1 Psychoacoustic Model 1 for Lower Sampling Frequencies

The necessary adaptations to psychoacoustic model 1 for the extension to lower sampling frequencies are small. A description of that psychoacoustic model is repeated here, with the necessary changes.

The calculation of the psychoacoustic model has to be adapted to the corresponding layer. The example presented here is valid for Layers I and II. The model can be adapted to Layer III.

There is no principal difference in the application of psychoacoustic model 1 to Layer I or II.

Layer I: A new bit allocation is calculated for each block of 12 subband or 384 input PCM samples.
Layer II: A new bit allocation is calculated for three blocks totalling 36 subband samples corresponding to 3*384 (1 152) input PCM samples.

The bit allocation of the 32 subbands is calculated on the basis of the signal-to-mask ratios of all the subbands. Therefore, it is necessary to determine for each subband, the maximum signal level and the minimum masking threshold. The minimum masking threshold is derived from an FFT of the input PCM signal, followed by a psychoacoustic model calculation.

The FFT performed in parallel with the subband filter operation compensates for the lack of spectral selectivity obtained at low frequencies by the subband filterbank. This technique provides both a sufficient time resolution for the coded audio signal (Polyphase filter with optimised window for minimal pre-echoes) and a sufficient spectral resolution for the calculation of the masking thresholds. The frequencies and levels of aliasing distortions can be calculated. This is necessary for calculating a minimum bitrate for those subbands which need some bits to cancel the aliasing components in the decoder. The additional complexity to calculate the better frequency resolution is necessary only in the encoder, and introduces no additional delay or complexity in the decoder.

The calculation of the signal-to-mask-ratio is based on the following steps:

Step 1
- Calculation of the FFT for time to frequency conversion.

Step 2
- Determination of the sound pressure level in each subband.

Step 3
- Determination of the threshold in quiet (absolute threshold).

Step 4
- Finding of the tonal (more sinusoid-like) and non-tonal (more noise-like) components of the audio signal.

Step 5
- Decimation of the maskers, to obtain only the relevant maskers.

Step 6
- Calculation of the individual masking thresholds.

Step 7
- Determination of the global masking threshold.

Step 8
- Determination of the minimum masking threshold in each subband.

Step 9
- Calculation of the signal-to-mask ratio in each subband.

These steps will be further discussed. A sampling frequency of 24 kHz is assumed, unless stated otherwise. For the other two sampling frequencies all frequencies mentioned should be scaled accordingly.

Step 1 Calculation of spectrum

The FFT is in principle the same as in ISO/IEC 11172-3, but due to the different sampling frequency the length when expressed in ms is different.

Technical data of the FFT:

Layer I Layer II
- transform length N samples
Window size if Fs = 24 kHz
Window size if Fs = 22,05 kHz
Window size if Fs = 16 kHz 512 samples
21,33 ms
23,22 ms
32 ms 1024 samples
42,67 ms
46,44 ms
64 ms
- Frequency resolution Fs / 512 Fs / 1024
- Hann window,

	Layer I	Layer II
- transform length N samples Window size if Fs = 24 kHz Window size if Fs = 22,05 kHz Window size if Fs = 16 kHz	512 samples 21,33 ms 23,22 ms 32 ms	1024 samples 42,67 ms 46,44 ms 64 ms
- Frequency resolution	Fs / 512	Fs / 1024
- Hann window,

- power density spectrum X(k):

where s(l) is the input signal.

A normalisation to the reference level of 96 dB SPL (Sound Pressure Level) has to be done in such a way that the maximum value corresponds to 96 dB.

Step 2 Determination of the sound pressure level

The sound pressure level

in subband n is computed by:

X(k) in subband n

where X(k) is the sound pressure level of the spectral line with index k of the FFT with the maximum amplitude in the frequency range corresponding to subband n. The expression is in Layer I the scalefactor, and in Layer II the maximum of the three scalefactors of subband n within a frame. The "10 dB" term corrects for the difference between peak and RMS level. The sound pressure level (n) is computed for every subband n.

The following alternative method of calculating (n) offers a potential for better encoder performance, but this technique has not been subjected to a formal audio quality test.

The alternative sound pressure level in subband n is computed by:

with

where (n) is the alternative sound pressure level corresponding to subband n.

Step 3 Considering the threshold in quiet

The threshold in quiet

(k), also called absolute threshold, is available in the tables "Frequencies, critical band rates and absolute threshold" (tables D.1a, D.1b, D.1c for Layer I; tables D.1d, D.1e, D.1f for Layer II). These tables depend on the sampling rate of the input PCM signal. Values are available for each sample in the frequency domain where the masking threshold is calculated.

Step 4 Finding of tonal and non-tonal components

The tonality of a masking component has an influence on the masking threshold. For this reason, it is worthwhile to discriminate between tonal and non-tonal components. For calculating the global masking threshold, it is necessary to derive the tonal and the non-tonal components from the FFT spectrum.

This step starts with the determination of local maxima, then extracts tonal components (sinusoids) and calculates the intensity of the non-tonal components within a bandwidth of a critical band. The boundaries of the critical bands are given in the tables "Critical band boundaries" (tables D.2a, D.2b, D.2c for Layer I; tables D.2d, D.2e, D.2f for Layer II).

The bandwidth of the critical bands varies with the center frequency with a bandwidth of about only 0,1 kHz at low frequencies and with a bandwidth of about 4 kHz at high frequencies. It is known from psychoacoustic experiments that the ear has a better frequency resolution in the lower than in the higher frequency region. To determine if a local maximum may be a tonal component, a frequency range df around the local maximum is examined. The frequency range df is given by:

Sampling rate: 16 kHz

df = 62,5 Hz	0 kHz	< f <=	3,0 kHz
df = 93,75 Hz	3,0 kHz	< f <=	6,0 kHz
df = 187,5 Hz	6,0 kHz	< f <=	7,5 kHz

Sampling rate: 22,05 kHz

df = 86,133 Hz	0 kHz	< f <=	2,756 kHz
df = 129,199 Hz	2,756 kHz	< f <=	5,512 kHz
df = 258,398 Hz	5,512 kHz	<f <=	10,336 kHz

Sampling rate: 24 kHz

df = 93,750 Hz	0 kHz	< f <=	3,0 kHz
df = 140,63 Hz	3,0 kHz	< f <=	6,0 kHz
df = 281,25 Hz	6,0 kHz	< f <=	11,250 kHz

To make lists of the spectral lines X(k) that are tonal or non-tonal, the following three operations are performed:

a) Labelling of local maxima

A spectral line X(k) is labelled as a local maximum if

X(k) > X(k1) and X(k) >= X(k+1)

b) Listing of tonal components and calculation of the sound pressure level

A local maximum is put in the list of tonal components if

X(k) X(k+j) >= 7 dB,

where j is chosen according to

Layer I, Fs=16 kHz:

j = 2, +2	for	2 < k < 96
j = 3,2, +2,+3	for	96 <= k < 192
j = 6,...,2,+2,...,+6	for	192<= k < 250

Layer II, Fs=16 kHz:

j = 4, +4	for	4 < k < 192
j = 6,...,2, +2,...,+6	for	192 <= k < 384
j = 12,...,2, +2,..., +12	for	384 <= k < 500

Layer I, Fs=22,05, 24 kHz:

j = 2, +2	for	2 < k < 64
j = 3,2, +2,+3	for	64 <= k < 128
j = 6,...,2,+2,...,+6	for	128<= k < 250

Layer II, Fs=22,05, 24 kHz:

j = 4, +4	for	4 < k < 128
j = 6,...,2, +2,...,+6	for	128 <= k < 256
j = 12,...,2, +2,..., +12	for	256 <= k < 500

If X(k) is found to be a tonal component, then the following parameters are listed:

Index number k of the spectral line.
Sound pressure level
Tonal flag.

Next, all spectral lines within the examined frequency range are set to dB.

c) Listing of non-tonal components and calculation of the power

The non-tonal (noise) components are calculated from the remaining spectral lines. To calculate the non-tonal components from these spectral lines X(k), the critical bands z(k) are determined using the tables, "Critical band boundaries" (tables D.2a, D.2b, D.2c for Layer I; tables D.2d, D.2e, D.2f for Layer II). 21 critical bands are used for the sampling rate of 16 kHz, 23 critical bands are used for 22,05 kHz and 24 kHz. Within each critical band, the power of the spectral lines (remaining after the tonal components have been zeroed) are summed to form the sound pressure level of the new non-tonal component (k) corresponding to that critical band.

The following parameters are listed:

Index number k of the spectral line nearest to the geometric mean of the critical band.
Sound pressure level (k) in dB.
Non-tonal flag.

Step 5 Decimation of tonal and non-tonal masking components

Decimation is a procedure that is used to reduce the number of maskers which are considered for the calculation of the global masking threshold.

In this expression, (k) is the absolute threshold (or threshold in quiet) at the frequency of index k. These values are given in tables D.1a, D.1b, D.1c for Layer I; tables D.1d, D.1e, D.1f for Layer II.

b) Decimation of two or more tonal components within a distance of less then 0,5 Bark: Keep the component with the highest power, and remove the smaller component(s) from the list of tonal components. For this operation, a sliding window in the critical band domain is used with a width of 0,5 Bark. In the following, the index j is used to indicate the relevant tonal or non-tonal masking components from the combined decimated list.

Step 6 Calculation of individual masking thresholds

Of the original N/2 frequency domain samples, indexed by k, only a subset of the samples, indexed by i, are considered for the global masking threshold calculation. The samples used are shown in tables D.1a, D.1b, D.1c for Layer I; tables D.1d, D.1e, D.1f for Layer II.

Layer I:

For the frequency lines corresponding to the frequency region which is covered by the first six subbands no subsampling is used. For the frequency region corresponding to the next six subbands every second spectral line is considered. Finally, every fourth spectral line is considered for the next 18 subbands (see also tables D.1a, D.1b, D.1c for Layer I).

Layer II:

For the frequency lines corresponding to the frequency region which is covered by the first three subbands no subsampling is used. For the frequency region which is covered by next three subbands every second spectral line is considered. For the frequency region corresponding to the next six subbands every fourth spectral line is considered. Finally, every eighth spectral line is considered for the next 18 subbands (See also tables D.1d, D.1e, D.1f for Layer II).

The number of samples, n, in the subsampled frequency domain depends on the layer. For Layer I, n equals 108, for Layer II, n equals 132.

Every tonal and non-tonal component is assigned the value of the index i that most closely corresponds to the frequency of the original spectral line X(k). This index i is given in tables D.1a, D.1b, D.1c for Layer I; tables D.1d, D.1e, D.1f for Layer II.

The individual masking thresholds of both tonal and non-tonal components are given by the following expression:

In this formula, and are the individual masking thresholds at critical band rate z in Bark of the masking component at the critical band rate of the masker in Bark. The values in dB can be either positive or negative. The term [z(j)] is the sound pressure level of the masking component with the index number j at the corresponding critical band rate z(j). The term av is called the masking index and vf the masking function of the masking component [z(j)]. The masking index av is different for tonal and non-tonal maskers ( and ).

For tonal maskers, it is given by

= 1,525 0,275 * z(j) 4,5 dB,

and for non-tonal maskers

= 1,525 0,175 * z(j) 0,5 dB.

The masking function vf of a masker is characterised by different lower and upper slopes, which depend on the distance in Bark dz = z(i) z(j) to the masker. In this expression i is the index of the spectral line at which the masking function is calculated and j that of the masker. The critical band rates z(j) and z(i) can be found in tables D.1a, D.1b, D.1c for Layer I; tables D.1d, D.1e, D.1f for Layer II. The masking function, which is the same for tonal and non-tonal maskers, is given by:

td>vf = 17 * (dz + 1) (0,4 * X[z(j)] + 6) dB td>vf = (0,4 * X[z(j)] + 6) * dz dB td>vf = 17 * dz dB td>vf = (dz 1) * (17 0,15 * X[z(j)]) 17 dB

for 3 <= dz < 1 Bark
for 1 <= dz < 0 Bark
for 0 <= dz < 1 Bark
for 1 <= dz < 8 Bark
In these expressions X[z(j)] is the sound pressure level of the jth masking component in dB. For reasons of implementation complexity, the masking is no longer considered if dz < 3 Bark, or dz >= 8 Bark ( and are set to - dB outside this range) .

Step 7 Calculation of the global masking threshold LTg

The global masking threshold

(i) at the

frequency sample is derived from the upper and lower slopes of the individual masking thresholds of each of the j tonal and non-tonal maskers and from the threshold in quiet

(i). This is also given in tables D.1a, D.1b, D.1c for Layer I; tables D.1d, D.1e, D.1f for Layer II. The global masking threshold is found by summing the powers corresponding to the individual masking thresholds and the threshold in quiet.

The total number of tonal maskers is given by m, and the total number of non-tonal maskers is given by n. For a given i, the range of j can be reduced to just encompass those masking components that are within 8 to +3 Bark from i. Outside of this range and are - dB.

Step 8 Determination of the minimum masking threshold

The minimum masking level

(n) in subband n is determined by the following expression:

where f(i) is the frequency of the frequency sample. The f(i) are tabulated in tables D.1a, D.1b, D.1c for Layer I; tables D.1d, D.1e, D.1f for Layer II. A minimum masking level (n) is computed for every subband.

Step 9 Calculation of the signal-to-mask-ratio

The signal-to-mask ratio

is computed for every subband n.

List of tables

Table D.1a. - Frequencies, critical band rates and absolute threshold
Table is valid for Layer I at a sampling rate of 16 kHz.
Table D.1b. - Frequencies, critical band rates and absolute threshold
Table is valid for Layer I at a sampling rate of 22,05 kHz.
Table D.1c. - Frequencies, critical band rates and absolute threshold
Table is valid for Layer I at a sampling rate of 24 kHz.
Table D.1d. - Frequencies, critical band rates and absolute threshold
Table is valid for Layer II at a sampling rate of 16 kHz.
Table D.1e. - Frequencies, critical band rates and absolute threshold
Table is valid for Layer II at a sampling rate of 22,05 kHz.
Table D.1f. - Frequencies, critical band rates and absolute threshold
Table is valid for Layer II at a sampling rate of 24 kHz.
Table D.2a. - Critical band boundaries
This table is valid for Layer I at a sampling rate of 16 kHz.
Table D.2b. - Critical band boundaries
This table is valid for Layer I at a sampling rate of 22,05 kHz.
Table D.2c. - Critical band boundaries
This table is valid for Layer I at a sampling rate of 24 kHz.
Table D.2d. - Critical band boundaries
This table is valid for Layer II at a sampling rate of 16 kHz.
Table D.2e. - - Critical band boundaries
This table is valid for Layer II at a sampling rate of 22,05 kHz
Table D.2f. - Critical band boundaries
This table is valid for Layer II at a sampling rate of 24 kHz.

D.2 Psychoacoustic Model 2 for Lower Sampling Frequencies

Psychoacoustic model 2 for lower sampling frequencies is identical to the psychoacoustic model 2 as described in ISO/IEC 11172-3, with some exceptions. The following tables are used instead of tables C.7.a ... C.8.e, for use with Layer III:

List of tables:

Table D.3.a -- Sampling_frequency = 24 kHz long blocks
Table D.3.b -- Sampling_frequency = 22,05 kHz long blocks
Table D.3.c -- Sampling_frequency = 16 kHz long blocks
Table D.3.d -- Sampling_frequency = 24 kHz short blocks
Table D.3.e -- Sampling_frequency = 22,05 kHz short blocks
Table D.3.f -- Sampling_frequency = 16 kHz short blocks

Table D.4 -- Tables for converting threshold calculation partitions to scalefactor bands

Table D.4.a -- Sampling_frequency = 24 kHz long blocks
Table D.4.b -- Sampling_frequency = 22,05 kHz long blocks
Table D.4.c -- Sampling_frequency = 16 kHz long blocks
Table D.4.d -- Sampling_frequency = 24 kHz short blocks
Table D.4.e -- Sampling_frequency = 22,05 kHz short blocks
Table D.4.f -- Sampling_frequency = 16 kHz short blocks

[ Index]

Layer I:	A new bit allocation is calculated for each block of 12 subband or 384 input PCM samples.
Layer II:	A new bit allocation is calculated for three blocks totalling 36 subband samples corresponding to 3*384 (1 152) input PCM samples.