Android-Java: The Spectrogram of speech using libgdx

In this post we will take a look at calculating the log magnitude spectrum in Java/Android of a speech signal. This is often called a spectrogram, where we want to observe the phonetic structure of speech across time and frequency. In this example I will only calculate the log spectrum and display it to the screen, you can then use this information for displaying the spectrogram or using the energy information for other purposes such as a vu meter or short-time time and log magnitude analysis. Just keep in mind this is an off-line approach, whereby the fft is stored in one big matrix for the entire signal, (by no means the most memory efficient method), but useful none the less.

Spoken voice Spectrogram

Spectrogram of speech by a male speaker

The frequency domain is calculated using Badlogic’s implementation of the Fast Fourier Transform algorithm in their gaming library called libgdx. Before you start, be sure to download the libgdx jar library files. Here is a good resource for learning how to install libgdx in eclipse along with some great tutorials on where to begin.

Some more preliminary information: Real world acoustic sound such as music, voice, environmental noise etc. is considered to be non-stationary and random. This basically means that the statistics (mean, variance and standard deviation) of the sounds amplitude would be changing so much there is no sign of periodicy. So if you were to take one large periodogram spectrum estimate over the entire recording it would result in a biased frequency spectrum, and additionally you would need one very very large N-point FFT. So to analyse an acoustic signal accurately, we divide it up into small overlapping frames, for speech it is common to use 20-40ms frames, and take the fft of each frame (short-time Fourier transform). This means you can begin to keep at least some of the signals amplitude statistics consistent frame by frame, otherwise known as quasi-stationary. In this example we use a 32ms frame size and a 4ms frame-shift (87.5% overlap). The short-time Fourier transform is given by:

$latex X(m,k)=sum^{infty}_{-infty}x(n)w(n-m)e^{~-j2pi kn/N}$

Where n refers to the discrete-time index, k is the index of the discrete frequency, N is the frame duration (in samples), and w(n) is the analysis window function and m is the frame index. w(n) is used to reduce spectral leakage between each frame and in this example a tapered window such as the Hamming window was used. Hamming window is given by [1]:

$latex w(n) = begin{cases} scriptstyle 0.54-0.46mathrm{cos}big(frac{2pi n}{N-1}big), & scriptstyle 0 leq n < N-1, \ 0, & scriptstyletextit{otherwise} end{cases}$

Ok, now that is the theory out of the way. In this example we read audio from from a file stored in the asset folder pre-packaged with our app, I deal with reading the wav file in an external class called WaveTools. But it wont be discussed here, check out the class in the source if you are curious.

You can download the source of this project here.

Firstly initialise all variables:

	float[] array_hat,res=null;
	float[] fmag = null;
	float[] flogmag = null;
	float[] fft_cpx,tmpr,tmpi;
	float[] mod_spec =null;
	float[] real_mod = null;
	float[] imag_mod = null;
	double[] real =null;
	double[] imag= null;
	double[] mag =null;
	double[] phase = null;
	double[] logmag = null;
	static float [][] framed;
	static int n, seg_len,n_shift;
	static float n_segs;
	float [] time_array;
	float [] array;
	float [] wn;
	double[] nmag;
	static float [][] spec;
	float [] array2;
	static float max;
	static float min;
	static float smax;
	static float smin;
	static float mux;
	static float smux;
	TextView left;
	TextView right;
	TextView title;
	int tshift = 4; //frame shift in ms
	int tlen = 32; //frame length in ms
	static float [] audioBuf;
	static String inputPath;	

Our oncreate method looks like this:


public class SpectrogramActivity extends Activity {
	/** Called when the activity is first created. */

	@Override
	public void onCreate(Bundle savedInstanceState) {
		super.onCreate(savedInstanceState);
	
		SetupUI();
		// Acquire input audio file
		inputPath = "sp10.wav";
		try{

			audioBuf = WaveTools.wavread(inputPath, this);
						
		}catch(Exception e){
			Log.d("SpecGram2","Exception= "+e);
		}

		/*	Calculate Log Spectrogram data,
		 * done with an AsyncTask to avoid
		 * consuming UI thread resources
		 */
		String dummy = "test";
		new calcSpec().execute(dummy);
		}

Where setupUI() is the function to setup the layout of the activity programatically, given by:

	private void SetupUI() {
		LayoutParams param1 = new LinearLayout.LayoutParams(
				LayoutParams.WRAP_CONTENT, LayoutParams.FILL_PARENT,
				(float) 1.0f);
		LayoutParams param2 = new LinearLayout.LayoutParams(
				LayoutParams.WRAP_CONTENT, LayoutParams.FILL_PARENT,
				(float) 1.0f);
		LayoutParams param3 = new LinearLayout.LayoutParams(
				LayoutParams.FILL_PARENT, LayoutParams.WRAP_CONTENT,
				(float) 0.1f);
		LayoutParams param4 = new LinearLayout.LayoutParams(
				LayoutParams.FILL_PARENT, LayoutParams.WRAP_CONTENT,
				(float) 1.0f);

		LinearLayout main = new LinearLayout(this);
		LinearLayout secondary = new LinearLayout(this);
		ScrollView scroll = new ScrollView(this);
		title = new TextView(this);
		left = new TextView(this);
	
		
		scroll.setLayoutParams(param4);
		main.setLayoutParams(param4);
		main.setOrientation(LinearLayout.VERTICAL);
		secondary.setLayoutParams(param1);
		secondary.setOrientation(LinearLayout.HORIZONTAL);

		title.setLayoutParams(param3);
		left.setLayoutParams(param2);
	

		secondary.addView(left);
		scroll.addView(secondary);

		main.addView(title);
		main.addView(scroll);

		setContentView(main);
		title.setText("FFT Spectrogram of speech example by DigiPhD");
		title.setTextSize(12);
		title.setTypeface(null, Typeface.BOLD);

		left.setText("Calculating.....n");


	}

And our calcSpec AsyncTask:

private class calcSpec extends AsyncTask {
		int fs = 0; // Sampling frequency
		int nshift = 0;// Initialise frame shift
		int nlen = 0;// Initialise frame length 
		float nsegs = 0 ; //Initialise the total number of frames		
		@Override
		protected String doInBackground(String... params) {
			fs = WaveTools.getFs();
			nshift = (int) Math.floor(tshift*fs/1000); // frame shift in samples
			nlen = (int) Math.floor(tlen*fs/1000);	// frame length in samples
			nsegs = 1+(float) (Math.ceil((audioBuf.length-(nlen))/(nshift)));
			specGram(audioBuf,nsegs,nshift,nlen);
			
			return null;			
		}
		
		@Override
		protected void onPostExecute(String result) {
			left.setText("");
			left.setTextSize(4);
			for (int j = 0; j < nsegs ; j++){
				for (int i = 0; i < nlen; i++) {
					left.append(Integer.toString((int) spec[i][j])+" ");
				}
			}
			
		}
	

Here you will notice it calls a function called specGram, this function is given below:

public void specGram(float[] data, float nsegs, int nshift, int seglen) {

		spec = new float[seglen][(int) nsegs];
		array2 = new float[seglen];
		seg_len = seglen;
		n_segs = nsegs;
		n_shift = nshift;
		time_array = new float[data.length];
		time_array = data;

		framed = new float[seg_len][(int) n_segs];
		framed = FrameSig();
		minmax(framed, seg_len, (int) n_segs);
		meansig((int) n_segs);

		array = new float[seg_len * 2];

		res = new float[seg_len];
		fmag = new float[seg_len];
		flogmag = new float[seg_len];

		mod_spec = new float[seg_len];
		real_mod = new float[seg_len];
		imag_mod = new float[seg_len];
		real = new double[seg_len];
		imag = new double[seg_len];
		mag = new double[seg_len];
		phase = new double[seg_len];
		logmag = new double[seg_len];
		nmag = new double[seg_len];
		for (int i = 0; i < seg_len * 2; i++) {
			array[i] = 0;
		}

		for (int j = 0; j < nsegs; j++) {
			FFT fft = new FFT(seg_len * 2, 8000);
			for (int i = 0; i < seg_len; i++) {
				array[i] = framed[i][j];
			}
			fft.forward(array);
			fft_cpx = fft.getSpectrum();
			tmpi = fft.getImaginaryPart();
			tmpr = fft.getRealPart();

			for (int i = 0; i < seg_len; i++) {

				real[i] = (double) tmpr[i];
				imag[i] = (double) tmpi[i];

				mag[i] = Math.sqrt((real[i] * real[i]) + (imag[i] * imag[i]));
				mag[i] = Math.abs(mag[i] / seg_len);

				logmag[i] = 20 * Math.log10(mag[i]);
				phase[i] = Math.atan2(imag[i], real[i]);

				/**** Reconstruction ****/
				// real_mod[i] = (float) (mag[i] * Math.cos(phase[i]));
				// imag_mod[i] = (float) (mag[i] * Math.sin(phase[i]));
				spec[(seg_len - 1) - i][j] = (float) logmag[i];

				// Log.d("SpecGram","log= "+logmag[i]);
			}
		}
		minmaxspec(spec, seg_len, (int) nsegs);
		meanspec((int) nsegs);
		// fft.inverse(real_mod,imag_mod,res);

	}

You’ll notice an assortment of other function used here:

- FrameSig()  --- > frame audio signal into overlapping frames
- minmaxspec() ---> mind max and min of spectrum
- minmax() ---> find max and min of audio
- meansig() --- > mean/average of audio
- meanspec() --- > mean/average of spectrum
- hamming() --- > hamming window function

These are listed below:

	/**
	 * Calculates the mean of the fft magnitude spectrum
	 * @param nsegs
	 */
	private void meanspec(int nsegs) {
		float sum = 0;
		 for (int j=1; j<(int)nsegs; j++) {
		    	for (int i = 0;i<seg_len;i++){
					
		    	sum += spec[i][j];
		        }
		    	}
		 
		  
	sum = sum/(nsegs*seg_len);
	mux = sum;		   
		
	}
	/**
	 * Calculates the min and max of the fft magnitude
	 * spectrum
	 * @param spec
	 * @param seglen
	 * @param nsegs
	 * @return
	 */
	public static float minmaxspec(float[][] spec, int seglen, int nsegs) {

		   smin = (float) 1e35;
		   smax = (float) -1e35;
		    for (int j=1; j<nsegs; j++) {
		    	for (int i = 0;i<seglen;i++){
					
		    	if (smax  spec[i][j]) {
		           smin=spec[i][j];   // new maximum
		        }
		    	}
		    }
		    return smax;
		}
	/**
	 * Calculates the min and max value of the framed signal
	 * @param spec
	 * @param seglen
	 * @param nsegs
	 * @return
	 */
	public static float minmax(float[][] spec, int seglen, int nsegs) {

		   min = (float) 1e35;
		   max = (float) -1e35;
		    for (int j=1; j<nsegs; j++) {
		    	for (int i = 0;i<seglen;i++){
					
		    	if (max  spec[i][j]) {
		           min=spec[i][j];   // new maximum
		        }
		    	}
		    }
		    return max;
		}
	
	/**
	 * Calculates the mean of the framed signal
	 * @param nsegs
	 */	
	private void meansig(int nsegs) { 
		float sum = 0;
		 for (int j=1; j<(int)nsegs; j++) {
		    	for (int i = 0;i<seg_len;i++){
					
		    	sum += framed[i][j];
		        }
		    	}
		 
		  
	sum = sum/(nsegs*seg_len);
	smux = sum;
		   
		
	}
	
	/**
	 * Frames up input audio 
	 * @return
	 */
	
	public float[][] FrameSig(){
		float [][] temp = new float [seg_len][(int)n_segs];
		float [][] frame = new float [seg_len][(int)n_segs];
		float padlen = (n_segs-1)*n_shift+seg_len;
		Log.d("DEBUG10","padlen = "+padlen);
		Log.d("DEBUG10","len = "+array2.length);
		
		 wn = hamming(seg_len);
		for (int i = 0; i < n_segs;i++){
		
			for (int j = 0;j<seg_len;j++){
		
				temp[j][i] = time_array[i*n_shift+j];//*wn[i];
			
			}
		}
		for (int i = 0; i < n_segs;i++){			// Windowing
			
			for (int j = 0;j<seg_len;j++){
		
					frame[j][i] = temp[j][i]*wn[j];
							
			}
		}
		return frame;
		
	}
	/**
	 * Calculates a hamming window to reduce
	 * spectral leakage
	 * @param len
	 * @return
	 */
	public float[] hamming(int len){
		float [] win = new float [len];
		for (int i = 0; i<len; i++){
			win[i] = (float) (0.54-0.46*Math.cos((2*Math.PI*i)/(len-1)));
		}
		return win;
	}

This stores a float [nlen][nsegs] matrix of log fft data called

spec

, and then displays it too the screen. Which will look something like this:

log fft

Log fft data printed on screen


Of course it’s pointless printing the raw values to the screen, but I had to put something on the screen :). You can now use this data to construct an image of the spectrogram like the one at the top of the post or use it for some other purpose. Hope you have enjoyed this. You can see this method working in action on my android app Speech enhancement for Android

Note the audio file is from [2].

References:

[1] J. Picone, “Signal modeling techniques in speech recognition,” Proc. IEEE, vol. 81, no. 9, pp. 1215–1247, Sep 1993
[2] Y. Hu and P. Loizou, “Subjective comparison of speech enhancement algorithms,” Speech Commiunication, vol. 49, pp. 588–601, 2007.