GriffinLim algorithm for waveform reconstruction
Consider a setting where we are given some modified shorttimeFouriertransform (STFT) or modified spectrogram (magnitude of STFT). The STFT/spectrogram is called modified to stress the fact that it in general does not corresponds to any valid timedomain signal: it may be obtained by modifying a STFT/spectrogram of some real signal, or it may simply be generated by a neural network. The task here is to synthesize a timedomain signal from this target STFT/spectrogram, such that the STFT/spectrogram of the synthesized signal is as close to the target as possible.
Let us denote the target STFT as \(Y_w[mS,k]\), where \(m\) is the index of the timewindow, and \(k\) represents the frequency axis. Here \(S\) denote the step size of the timedomain window, and \(mS\) can be viewed as the center of the sliding time windows. Correspondingly, if we are talking about spectrogram, the target can be denoted as \(\Y_w(mS,\omega)\\), to emphasis that there is no phase information.
First, let’s align the definition and notation of the STFT/spectrogram. For a given timedomain signal \(x[n]\), it is masked by a finite length window function \(w\), yielding \(x_w[mS,n]\triangleq w[nmS]x[n]\), with \(mS\) denoting the offset of the window from the origin, and \(m\) being the index of timedomain slice. The STFT \(X_w[mS,k]\) is obtained by taking the DFT of each windowed signal \(x_w[mS, n]\) with respect to timeindex \(n\):
\[\begin{align*} X_w[mS,k]=\text{DFT}(x_w[mS, n]). \end{align*}\]The objective is then to find a time domain signal \(x[n]\) such that its STFT \(X_w[mS,k]\) is as close to the target \(Y_w[mS,k]\) as possible. Note, again, that the target \(Y_w[mS,k]\) in general does not correspond to any valid timedomain signal. In other words, there is no timedomain signal that produces \(Y_w[mS,k]\) as its STFT. Going back to the objective, a natural way to define the closeness is simply the sum squared distance, expressed as below
\[\begin{align*} D\left(X_w, Y_w\right) = \frac{1}{N}\sum_{m=\infty}^{\infty}\sum_{k=0}^{N1}X_w[mS,k]Y_w[mS,k]^2\notag, \end{align*}\]where \(N\) denote the windowlength (and correspondingly the DFT size). By Parseval’s theorem we have
\[\begin{align*} D\left(X_w, Y_w\right) =&\sum_{m=\infty}^{\infty}\sum_{l=\infty}^{\infty}x_w[mS,l]y_w[mS,l]^2\notag\\ =&\sum_{m=\infty}^{\infty}\sum_{l=\infty}^{\infty}w[nmS]x[n]y_w[mS,l]^2\notag, \end{align*}\]where \(y_w[mS,l]\) is the IDFT of \(Y_w[mS,k]\).
This is just the leastsquare problem, where the solution of \(x[n]\) can be obtained by taking the derivative with respect to \(D\left(X_w, Y_w\right)\) and enforce it to be zero, yielding the following solution:
\[\begin{align*} x[n]=\frac{ \sum_{m=\infty}^\infty w[nSm]y_w[mS,n] }{ \sum_{m=\infty}^\infty w^2[nSm] }\notag, \end{align*}\]With carefully designed window function, we can make the denominator to be \(1\), resulting in a form where the reconstructed signal \(x\) is simply the sum of the IDFT of the target STFT, weighted by the window function \(w\). Two such window functions are: (1) Rectangular window \(w_r\) with length \(L\) being an integer multiple of the sliding stepsize \(S\) (2) Sinusoidal window
\[\begin{align*} w_s[n] = \frac{2w_r[n]}{\sqrt{4a^2+2b^2}}\left[a+b\cos\left(\frac{2\pi n}{L}+\phi\right)\right]\notag. \end{align*}\]with length \(L\) being a multiple of four times the window shift \(S\).
Now that we have introduced the reconstruction of modified STFT with a leastsquare error estimate (LSEE) criterion, let us look at the corresponding reconstruction of the modified spectrogram. Without phase information, the leastsquare optimization problem is modified to
\[\begin{align*} D\left(X_w, Y_w\right) = \frac{1}{N}\sum_{m=\infty}^{\infty}\sum_{k=0}^{N1}\left(X_w[mS,k]Y_w[mS,k]\right)^2\notag, \end{align*}\]It can be showed that the following iterative algorithm reduces the error function above after each iteration:

Start with initial estimate of \(x^{(t)}[n]\)

Obtain its STFT \(X_w^{(t)}[mS, k]\)

Replace the magnitude of \(X_w^{(t)}[mS, k]\) with the target spectrogram \(\Y_w[mS, k]\\) and preserve the phase. More precisely, obtain \(\widehat{X}_w^{(t)}[mS, k]\)
 Get an updated estimate of \(x^{(t+1)}[n]\) by running the LSEE algorithm with target STFT of \(\widehat{X}_w^{(t)}[mS, k]\)
and go back to step 1.