Consider a setting where we are given some modified short-time-Fourier-transform (STFT) or modified spectrogram (magnitude of STFT). The STFT/spectrogram is called modified to stress the fact that it in general does not corresponds to any valid time-domain signal: it may be obtained by modifying a STFT/spectrogram of some real signal, or it may simply be generated by a neural network. The task here is to synthesize a time-domain signal from this target STFT/spectrogram, such that the STFT/spectrogram of the synthesized signal is as close to the target as possible.

Let us denote the target STFT as $$Y_w[mS,k]$$, where $$m$$ is the index of the time-window, and $$k$$ represents the frequency axis. Here $$S$$ denote the step size of the time-domain window, and $$mS$$ can be viewed as the center of the sliding time windows. Correspondingly, if we are talking about spectrogram, the target can be denoted as $$\|Y_w(mS,\omega)\|$$, to emphasis that there is no phase information.

First, let’s align the definition and notation of the STFT/spectrogram. For a given time-domain signal $$x[n]$$, it is masked by a finite length window function $$w$$, yielding $$x_w[mS,n]\triangleq w[n-mS]x[n]$$, with $$mS$$ denoting the offset of the window from the origin, and $$m$$ being the index of time-domain slice. The STFT $$X_w[mS,k]$$ is obtained by taking the DFT of each windowed signal $$x_w[mS, n]$$ with respect to time-index $$n$$:

\begin{align*} X_w[mS,k]=\text{DFT}(x_w[mS, n]). \end{align*}

The objective is then to find a time domain signal $$x[n]$$ such that its STFT $$X_w[mS,k]$$ is as close to the target $$Y_w[mS,k]$$ as possible. Note, again, that the target $$Y_w[mS,k]$$ in general does not correspond to any valid time-domain signal. In other words, there is no time-domain signal that produces $$Y_w[mS,k]$$ as its STFT. Going back to the objective, a natural way to define the closeness is simply the sum squared distance, expressed as below

\begin{align*} D\left(X_w, Y_w\right) = \frac{1}{N}\sum_{m=-\infty}^{\infty}\sum_{k=0}^{N-1}|X_w[mS,k]-Y_w[mS,k]|^2\notag, \end{align*}

where $$N$$ denote the window-length (and correspondingly the DFT size). By Parseval’s theorem we have

\begin{align*} D\left(X_w, Y_w\right) =&\sum_{m=-\infty}^{\infty}\sum_{l=-\infty}^{\infty}|x_w[mS,l]-y_w[mS,l]|^2\notag\\ =&\sum_{m=-\infty}^{\infty}\sum_{l=-\infty}^{\infty}|w[n-mS]x[n]-y_w[mS,l]|^2\notag, \end{align*}

where $$y_w[mS,l]$$ is the IDFT of $$Y_w[mS,k]$$.

This is just the least-square problem, where the solution of $$x[n]$$ can be obtained by taking the derivative with respect to $$D\left(X_w, Y_w\right)$$ and enforce it to be zero, yielding the following solution:

\begin{align*} x[n]=\frac{ \sum_{m=-\infty}^\infty w[n-Sm]y_w[mS,n] }{ \sum_{m=-\infty}^\infty w^2[n-Sm] }\notag, \end{align*}

With carefully designed window function, we can make the denominator to be $$1$$, resulting in a form where the reconstructed signal $$x$$ is simply the sum of the IDFT of the target STFT, weighted by the window function $$w$$. Two such window functions are: (1) Rectangular window $$w_r$$ with length $$L$$ being an integer multiple of the sliding step-size $$S$$ (2) Sinusoidal window

\begin{align*} w_s[n] = \frac{2w_r[n]}{\sqrt{4a^2+2b^2}}\left[a+b\cos\left(\frac{2\pi n}{L}+\phi\right)\right]\notag. \end{align*}

with length $$L$$ being a multiple of four times the window shift $$S$$.

Now that we have introduced the reconstruction of modified STFT with a least-square error estimate (LSEE) criterion, let us look at the corresponding reconstruction of the modified spectrogram. Without phase information, the least-square optimization problem is modified to

\begin{align*} D\left(X_w, Y_w\right) = \frac{1}{N}\sum_{m=-\infty}^{\infty}\sum_{k=0}^{N-1}\left(|X_w[mS,k]|-|Y_w[mS,k]|\right)^2\notag, \end{align*}

It can be showed that the following iterative algorithm reduces the error function above after each iteration:

1. Start with initial estimate of $$x^{(t)}[n]$$

2. Obtain its STFT $$X_w^{(t)}[mS, k]$$

3. Replace the magnitude of $$X_w^{(t)}[mS, k]$$ with the target spectrogram $$\|Y_w[mS, k]\|$$ and preserve the phase. More precisely, obtain $$\widehat{X}_w^{(t)}[mS, k]$$

\begin{align*} \widehat{X}_w^{(t)}[mS, k]=|Y_w[mS, k]|\frac{X_w^{(t)}[mS, k]}{|X_w^{(t)}[mS, k]|}\notag. \end{align*}
1. Get an updated estimate of $$x^{(t+1)}[n]$$ by running the LSEE algorithm with target STFT of $$\widehat{X}_w^{(t)}[mS, k]$$
\begin{align*} x^{(t+1)}[n]=\frac{ \sum_{m=-\infty}^\infty w[n-Sm]\text{IDFT}\left(\widehat{X}_w^{(t)}[mS, k]\right) }{ \sum_{m=-\infty}^\infty w^2[n-Sm] }\notag, \end{align*}

and go back to step 1.