Consider a setting where we are given some modified short-time-Fourier-transform (STFT) or modified spectrogram (magnitude of STFT). The STFT/spectrogram is called modified to stress the fact that it in general does not corresponds to any valid time-domain signal: it may be obtained by modifying a STFT/spectrogram of some real signal, or it may simply be generated by a neural network. The task here is to synthesize a time-domain signal from this target STFT/spectrogram, such that the STFT/spectrogram of the synthesized signal is as close to the target as possible.

Let us denote the target STFT as \(Y_w[mS,k]\), where \(m\) is the index of the time-window, and \(k\) represents the frequency axis. Here \(S\) denote the step size of the time-domain window, and \(mS\) can be viewed as the center of the sliding time windows. Correspondingly, if we are talking about spectrogram, the target can be denoted as \(\|Y_w(mS,\omega)\|\), to emphasis that there is no phase information.

First, let’s align the definition and notation of the STFT/spectrogram. For a given time-domain signal \(x[n]\), it is masked by a finite length window function \(w\), yielding \(x_w[mS,n]\triangleq w[n-mS]x[n]\), with \(mS\) denoting the offset of the window from the origin, and \(m\) being the index of time-domain slice. The STFT \(X_w[mS,k]\) is obtained by taking the DFT of each windowed signal \(x_w[mS, n]\) with respect to time-index \(n\):

\[\begin{align*} X_w[mS,k]=\text{DFT}(x_w[mS, n]). \end{align*}\]

The objective is then to find a time domain signal \(x[n]\) such that its STFT \(X_w[mS,k]\) is as close to the target \(Y_w[mS,k]\) as possible. Note, again, that the target \(Y_w[mS,k]\) in general does not correspond to any valid time-domain signal. In other words, there is no time-domain signal that produces \(Y_w[mS,k]\) as its STFT. Going back to the objective, a natural way to define the closeness is simply the sum squared distance, expressed as below

\[\begin{align*} D\left(X_w, Y_w\right) = \frac{1}{N}\sum_{m=-\infty}^{\infty}\sum_{k=0}^{N-1}|X_w[mS,k]-Y_w[mS,k]|^2\notag, \end{align*}\]

where \(N\) denote the window-length (and correspondingly the DFT size). By Parseval’s theorem we have

\[\begin{align*} D\left(X_w, Y_w\right) =&\sum_{m=-\infty}^{\infty}\sum_{l=-\infty}^{\infty}|x_w[mS,l]-y_w[mS,l]|^2\notag\\ =&\sum_{m=-\infty}^{\infty}\sum_{l=-\infty}^{\infty}|w[n-mS]x[n]-y_w[mS,l]|^2\notag, \end{align*}\]

where \(y_w[mS,l]\) is the IDFT of \(Y_w[mS,k]\).

This is just the least-square problem, where the solution of \(x[n]\) can be obtained by taking the derivative with respect to \(D\left(X_w, Y_w\right)\) and enforce it to be zero, yielding the following solution:

\[\begin{align*} x[n]=\frac{ \sum_{m=-\infty}^\infty w[n-Sm]y_w[mS,n] }{ \sum_{m=-\infty}^\infty w^2[n-Sm] }\notag, \end{align*}\]

With carefully designed window function, we can make the denominator to be \(1\), resulting in a form where the reconstructed signal \(x\) is simply the sum of the IDFT of the target STFT, weighted by the window function \(w\). Two such window functions are: (1) Rectangular window \(w_r\) with length \(L\) being an integer multiple of the sliding step-size \(S\) (2) Sinusoidal window

\[\begin{align*} w_s[n] = \frac{2w_r[n]}{\sqrt{4a^2+2b^2}}\left[a+b\cos\left(\frac{2\pi n}{L}+\phi\right)\right]\notag. \end{align*}\]

with length \(L\) being a multiple of four times the window shift \(S\).

Now that we have introduced the reconstruction of modified STFT with a least-square error estimate (LSEE) criterion, let us look at the corresponding reconstruction of the modified spectrogram. Without phase information, the least-square optimization problem is modified to

\[\begin{align*} D\left(X_w, Y_w\right) = \frac{1}{N}\sum_{m=-\infty}^{\infty}\sum_{k=0}^{N-1}\left(|X_w[mS,k]|-|Y_w[mS,k]|\right)^2\notag, \end{align*}\]

It can be showed that the following iterative algorithm reduces the error function above after each iteration:

  1. Start with initial estimate of \(x^{(t)}[n]\)

  2. Obtain its STFT \(X_w^{(t)}[mS, k]\)

  3. Replace the magnitude of \(X_w^{(t)}[mS, k]\) with the target spectrogram \(\|Y_w[mS, k]\|\) and preserve the phase. More precisely, obtain \(\widehat{X}_w^{(t)}[mS, k]\)

\[\begin{align*} \widehat{X}_w^{(t)}[mS, k]=|Y_w[mS, k]|\frac{X_w^{(t)}[mS, k]}{|X_w^{(t)}[mS, k]|}\notag. \end{align*}\]
  1. Get an updated estimate of \(x^{(t+1)}[n]\) by running the LSEE algorithm with target STFT of \(\widehat{X}_w^{(t)}[mS, k]\)
\[\begin{align*} x^{(t+1)}[n]=\frac{ \sum_{m=-\infty}^\infty w[n-Sm]\text{IDFT}\left(\widehat{X}_w^{(t)}[mS, k]\right) }{ \sum_{m=-\infty}^\infty w^2[n-Sm] }\notag, \end{align*}\]

and go back to step 1.