Skip to content

Latest commit

 

History

History
47 lines (27 loc) · 4.55 KB

README.md

File metadata and controls

47 lines (27 loc) · 4.55 KB

STeAMed haM-ifiER (STAMMER)

If you re-arranged the frames of Steamed Hams in the right order, could you produce the FitnessGram Pacer Test?

fitnessgram_pacer_test.mp4

STAMMER is a python utility for recreating any audio source with the frames of any other audio or video source. For each frame in the audio source being recreated, STAMMER selects the frame in the source audio with the most similar spectrum, adjusts its volume as needed, and inserts that frame into the output. Despite its name, STAMMER works on any audio, not just Steamed Hams.

Instructions

STAMMER requires python 3, and the following python libraries:

numpy
scipy
Pillow

These can be installed by running pip install -r requirements.txt.

It also requires the command line utility ffmpeg.

STAMMER can be run from the command line with the command

python stammer.py <carrier track> <modulator track> <output file>

where <carrier track> is the path to an audio or video file that frames will be taken from (i.e. Steamed Hams in the above example), <modulator track> is the path to an audio or video file that will be reconstructed using the carrier track, and <output file> is a path to file that will be written to. <output file> should have an audio or video file extension (such as .wav, .mp3, .mp4, etc).

Why and How

For several years, making increasingly outlandish edits of the skit "Steamed Hams" from The Simpsons has been a trend online. This project was primarily inspired by an edit of "Steamed Hams" in which the frames were sorted by pitch. It really is worth a listen: the choppy untintelligible vocals, steadily rising in pitch, builds tension in a strange way. Even chopped to the granularity of a frame, the music stabs and dialogue from the skit is still recognizable.

I've messed around with audio in numpy enough that this got me thinking: what else could I do with these frames? The idea of turning "Steamed Hams" into another song by manipulating its frequencies isn't completely novel: other videos have done it using a vocoder. Roughly speaking vocoder divides the carrier signal (the other song) into frequency bands, and then sets the amplitude of each band to the amplitude of the modulator signal ("Steamed Hams") at that point in time. The result is an amalgamation of the two signals, where the pitch of the carrier is "sung" through the timbre of the modulator.

Like the vocoder, STAMMER divides the carrier and modulator signal into frequency bands, and calculates the amplitude of each band. The algorithm is as follows:

Let the carrier $C$ be broken into frames $C_i, 0 \leq i &lt; n_c$ for each of its frames, and the modulator $M$ be broken into frames $M_i, 0 \leq i &lt; n_m$ for each of the frames. We replace each frame in $M$ with the frame of $C$ where the spectrum $c_i$ of $C_i$ is most similar to the spectrum $m_i$ of $M_i$.

The spectrum can be calculated from the frame using the Fast Fourier Transform. This algorithm returns a spectrum with far higher resolution than we need for these purposes, especially at high frequencies. To better reflect the frequencies that the human ear is sensitive to, we average the spectra of nearby frequencies into "bins," using larger bins for higher frequencies.

To give the algorithm more options, for a cleaner-sounding result, I decided to allow STAMMER to change the volume of carrier frames to match the volume of modulator frames. This isn't quite as involved as in a vocoder, where we change the volume of the frequencies in each bin independently. Instead, we are merely linearly scaling the spectrum vector.

After normalizing, we want to minimize the angle between the two vectors, which maximizes their similarity. In other words, we find $c_j$ that maximizes the dot product $\frac{c_j}{|c_j|}\cdot\frac{m_i}{|m_i|}$. For future work, it may be worth looking at Accuracy limitations and the measurement of errors in the stochastic simulation of chemically reacting systems for more advanced ways of comparing histograms than merely averaging nearby frequency lines into bins and calculating the dot products.