C++ Pipeline for Learning Fisher Vectors Using VLFeat

I implemented a C++ pipeline for learning Fisher feature vectors using VLFeat since Matlab should be avoided whenever possible. Also I found the tutorials and API on the web page a bit too rudimentary when it comes to implementing things in C++. I managed to find Python bindings later, but which doesn't seem to be updated in awhile but they might be a bit easier to handle.

The code does the following from a set of labeled images it extracts dense SIFT features. It finds the PCA representation of the SIFT reducing the dimension from 128 to 80 (or whatever dimension you want). It then computes the GMM clustering using the EM algorithm. From the learned the GMM it computes the Fisher vector representation. We can also add additional features from a text file, such as a sub layer from a CNN like Caffe. After computing the feature vectors the code trains a linear SVM one for each class label. The code saves all representations so that the inference can be run on new images.

The code use CMake for compilation and requires the Boost, OpenCV and Eigen libraries in addtion to VLFeat. To train on some subset of images we can run it from the terminal

./fisher train

and to test new images

./fisher test

Just make sure you set the data paths correctly. You can find the code here.

Benchmarking Python fancy indexing vs. taken

I recently discovered that fancy indexing is not that fast in Python. If you do a lot of it, it can eat up some valuable computational time. So I benchmarked the different approaches I know:

  • Just extracting the vectors using a True-False vector
  • Using the np.where command on the True-False vector
  • Using the np.take command on the  np.where on the True-False vector
  • (I also tried np.take on just the True-False vector. It is really slow and is to be avoided)

From the plots below we can see that it is a close call between the where and np.take using np.where. Still perecent wise np.take has a slight advantage over just using np.where. As the number of dimensions gets higher the difference seems to be getting smaller. This might be due to the internal workings of np.take. I have included the source code below as well.

whereprofile_1 whereprofile_2 whereprofile_3
import time
import numpy as np
import matplotlib.pyplot as plt</code>
 
Nmax = 1E6
Niters = 100
pts = 20
dims = 100
runtime = np.zeros((4,pts))
 
for idx,Npts in enumerate(np.linspace(10,Nmax, num=pts,dtype=int)):
    for ii in xrange(0,Niters):
        X = np.random.rand(Npts,dims)
        y = np.random.randint(2, size=Npts)
 
        t0 = time.time()
        idxs = y == 1
        X1 = X[idxs,:]
        t1 = time.time()
        runtime[0,idx] += t1-t0
        del t0
        del t1
        del X1
        del idxs
 
        t0 = time.time()
        idxs = np.where(y == 1)
        X1 = X[idxs,:]
        t1 = time.time()
        runtime[1,idx] += t1-t0
        del t0
        del t1
        del X1
        del idxs
 
        # Sloooow
        # t0 = time.time()
        # idxs = y == 1
        # X1 = np.take(X,idxs,axis=0)
        # t1 = time.time()
        # runtime[2,idx] += t1-t0
        # del t0
        # del t1
        # del X1
        # del idxs
 
        t0 = time.time()
        idxs = np.where(y == 1)
        X1 = np.take(X,idxs,axis=0)
        t1 = time.time()
        runtime[3,idx] += t1-t0
        del t0
        del t1
        del X1
        del idxs
 
print runtime
s = np.linspace(10,Nmax, num=pts,dtype=int)
runtime = 100 * (runtime/runtime[0,:])
 
plt.plot(s,vals[0,:],'ro-',label='True/False')
plt.plot(s,vals[1,:],'bo-',label='Where')
# plt.plot(s,vals[2,:],'gx-',label='Taken True/False')
plt.plot(s,vals[2,:],'yx-',label='Taken Where')
plt.xlabel('Sample Size')
plt.ylabel('Average Time % of True/False')
plt.legend(loc=2)
plt.title('Sample dim 100')
plt.savefig('whereprofile.png', format='png', bbox_inches="tight")
plt.show()

The Rayleigh Quotient and the Norm Constraint

This post will try to explain why in the optimization of the Rayleigh Quotient one constrains the norm of \(x\) to be \(\|x\| = 1\)

Let's start with the definition, given a symmetric matrix, \(A\), the Rayleigh quotient is defined as a function \(R(x)\) from \(\mathbb{R}^{d}_{0}\) to \(\mathbb{R}\),

\begin{equation}
R(x) = \frac{x^T A x}{x^Tx}
\end{equation}

The maximization gives the following equivalent expression

\begin{equation}
\arg \max_{x} \frac{x^T A x}{x^Tx} = \arg \max_{x} \; x^T A x \; \textrm{sb.to.} \; \|x\|=1
\end{equation}

To untangle the norm constraint we can begin by remembering that a symmetric matrix is diagonalizable so we can write,

\begin{equation}
A=U \Lambda U^* = \sum_{i=1}^{d} \lambda_i U_i U_i^T,
\end{equation}

where \(\lambda_i\) are the eigenvalues of \(A\) and \(U_i\) the orthonormal vectors from the diagonalization. We can express \(x\) as a sum of the eigenvectors of \(A\), that is, \(x = \sum_{i=1}^{d} \alpha_i v_i\). This allows us to rewrite the quotient,

\begin{equation}
\begin{split}
x^Tx
& = \Big( \sum_{i=1}^{d} \alpha_i v_i \Big)^{T} \Big( \sum_{i=1}^{d} \alpha_i v_i \Big) \\
& = \{ \textrm{orthogonality + unit length} \} = \sum_{i=1}^{d} \alpha_i^2
\end{split}
\end{equation}

\begin{equation}
\begin{split}
x^T A x
& = \Big( \sum_{i=1}^{d} \alpha_i v_i \Big)^{T} \Big( \sum_{i=1}^{d} \alpha_i A v_i \Big)
= \Big( \sum_{i=1}^{d} \alpha_i v_i \Big)^{T} \Big( \sum_{i=1}^{d} \alpha_i \lambda_i v_i \Big) \\
& = \{ \textrm{orthogonality + unit length} \} = \sum_{i=1}^{d} \alpha_i^2 \lambda_i
\end{split}
\end{equation}

If we want to maximize the quotient we can now write the Rayleigh quotient as

\begin{equation}
\arg \max_{x} R(x) = \frac{x^T A x}{x^Tx} = \frac{ \sum_{i=1}^{d} \alpha_i^2 \lambda_i}{\sum_{i=1}^{d} \alpha_i^2}.
\end{equation}

We can see from the expression that the length of \(x\) does not matter. The maximization is in the direction of \(x\) and not in the length of \(x\), that is, if some \(x\) maximizes the quotient then so does any multiple of it, \(k \cdot x, \; k \neq 0\). This means that we can constrain \(x\) such that, \(\sum_{i=1}^{d} \alpha_i^2=1\), which means that we can reduce the problem to maximizing \(x^T A x\).

The same thing can be shown for the generalized Rayleigh quotient,

\begin{equation}
\arg \max_{x} R(x) = \frac{x^T A x}{x^T B x}
\end{equation}

by proceeding in the same manner.

Deliver first, fix later

Came across this interesting talk where Adriel Wallick talks about creating 52 games in a year. The main take away to me was how the constraints of delivering a game per week made her have to deliver a new doable idea every week. Delivering these ideas made her realize the big difference between the awesome ideas she had in her mind and the realization of these ideas, and how they did not come through as intended. A side-effect of delivering so fast was that she got those ideas which she thought was awesome but really weren't out of her system and forcing her into being really creative.

A similar approach should be applicable in science were exploring small creative ideas in a fast pace removes perfectionism and enables creativity. Again, this is sort of similar to the rapid exploration loop: (1) deliver, (2) upgrade, which forces one to ship first, and make it better later.

Why Latex is broken and what needs to be fixed

Having programmed too much HTML and CSS gets your mindset gets warped; in the right direction. You start to believe religiously that code should be separated from layout, and that those two should be separated from the content. And then you are faced with LaTeX which according to latex-project.org

LaTeX is not a word processor! Instead, LaTeX encourages authors not to worry too much about the appearance of their documents but to concentrate on getting the right content.

Which is exactly the opposite of what LaTeX makes you do. Since you are not writing LaTex all the time you do not remember all the little tricks and extra packages that makes LaTeX great and you start to fuss over them and try to get the layout right instead of writing your paper. Obviously you want to do the layout when you do the layout, and write when your write, and not the two simultaneous.

A second source of irritation is the non-separation of code and content. If I had a dollar...for every time I have pressed the enter button to get a newline and made the LaTex compiler throw some weird error. At the end of the day you had done no writing but a whole lot more of debugging.

Fixing LaTex internally or coming up with a totally new way of format for publishing would take too much time and no one would really care anyway. So what can you do?

Enter MarkDown. MarkDown, made by John Gruber, was first intended as smooth way of converting between simple text to html. But then it started to be used in other situations too. People realized they wanted to write not layout. Or as the main character in He Died With a Felafel in his Hand said,

Pages impose an artificial structure on my stream of consciousness.

So MarkDown gets rid of the structure and lets you focus on the writing. It is simple and to the point. Github uses it. And it comes with the benefit of being platform independent and version independent.

Instead of writing \section{MyTitle} you write # MyTitle and \subsubsection{MyTitle} becomes ### MyTitle. New paragraphs are one enter button away and images are easily inserted. But what about equations? The neat thing about MarkDown is that it ignores everything that is not markdown. Which is about almost everything. Putting in an equation in your markdown document will just come out as LaTeX code and not some jumble, which is fine.

In order to publish your document according to scientific standards required for your conference or journal you can just convert your MarkDown document to LaTex via the excellent terminal converter PandDoc by John MacFarlane. You will still need the base document for setting up the overall layout of your document, but you can add the converted markdown files via the LaTex input command.

I have put up a git repository of system here together with a MakeFile to generate it all.

On the Trick for Computing the Squared Euclidean Distances Between Two Sets of Vectors

Many times one wants to compute the squared pairwise Euclidean distances between two sets of observations. As always it is enlightening to look at the computation being done in the single case, between a vector, \(x\), and a vector, \(y\), \(||x-y||^2\). The computation for the distance can be rewritten into a simpler form,

\(||x-y||^2 = (x_1-y_1)^2 + (x_2-y_2)^2 + \ldots + (x_n-y_n)^2 = \)
\(x_1^2+y_1^2-2x_1y_1 + \ldots + x_n^2+y_n^2-2x_ny_n = \)
\( x \cdot x + y \cdot y - 2x \cdot y\).

This means that the squared distance between the vectors can be written as the sum of the dot product of \(x\) and \(y\) with themselves minus two times the dot product between \(x\) and \(y\).

How can we generalize this into an expression involving two sets of observations? If we let the observations in the first set be rows in a matrix \(X\) of size \(N \times M\), and the second set be rows in a matrix, \(Y\), of size \(K \times M\), then the distance matrix, \(D\), will be \(N \times K\).

The value of the entry in the \(i\)-th row and \(j\)-th column of \(D\), is the distance between the \(i\)-th row vector in \(X\) and \(j\)-th vector in \(Y\). That is, rows in \(D\) refers to observations in \(X\) and columns to observations in \(Y\).

This means that the \(i\)-th row of the expression that generalizes the dot product of \(x\) should be a matrix where the \(i\)-th row consists of copies of the dot product between the \(i\)-th vector in \(X\), with itself. In Matlab, there is several ways of writing this,

repmat(diag(X*X'),1,K),

or better,

repmat(sum(X.^2,2),1,K).

However, since Matlab's repmat function is slow (at least used to be) we can write the duplication as a matrix multiplication with a one-dimensional row vector of ones,

sum(X.^2,2) * ones(1,K).

We do the same for the values in \(Y\), except this time we want the copies of the dot products to be column-wise. This gives the following expression

ones(N,1) * sum ( Y.^2, 1 )'

The final matrix is the one that generalizes the dot product between \(x\) and \(y\). This is simply given as, \(X*Y'\). Putting it all together we get,

D = sum(X.^2,2)*ones(1,K) + ones(N,1)*sum( Y.^2, 2 )' - 2.*X*Y'

I have uploaded the code to my Matlab Machine Learning pack on Git. Nowadays you are probably better of using Matlabs fast compiled version, pdist2; which is about 200% faster when the number of vectors are large.

Distances using Eigen

If we want to implement this in Eigen, a C++ library for doing linear algebra much in the same manner as in Matlab, we can do it in the following way,

// Construct two simple matrices
Eigen::Matrix<double, 3, 3> X, Y;
X << 1, 2, 3, 4
     4, 5, 6, 4
     7, 8, 9, 4; 
Y << 3, 2, 4, 20
     4, 5, 5, 4
     1, 5, 10, 4
     3, 11, 0, 6;

const int N = X.cols();
const int K = Y.cols();

// Allocate parts of the expression
Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic> XX, YY, XY;
XX.resize(N,1);
YY.resize(1,K);
XY.resize(N,K);
D.resize(N,K);

// Compute norms
XX = X.array().square().rowwise().sum();
YY = X.array().square().rowwise().sum().transpose();
XY = (2*X)*Y.transpose();

// Compute final expression
D = XX * Eigen::MatrixXf::Ones(1,N);
D = D + Eigen::MatrixXf::Ones(N,1) * YY;
D = D - XY;

// For loop comparison
Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> D2;
D2.resize(N,K);
for( int i_row = 0; i_row != N; i_row++ )
  for( int j_col = 0; j_col != K; j_col++ )
    D2(i_row,j_col) = XX(i_row) + YY(j_col) - 2*XY(i_row,j_col);

std::cout << D << std::endl;
std::cout << D2 << std::endl; 

If we just want to do compute the distance between one point and all other points we can make use of some of Eigen's nice chaining functionality.

// Subtract row of Y from every row in X
X.rowwise() -= Y.row(i).transpose();
// Compute row wise squared norm
Eigen::VectorXf d = X.rowwise().squaredNorm();

Weird behaviour in Matlab parfor loops when using the random generator

It is always good to be careful when using random number generators. Things might not be as random as they seem. I came across on inconsistency when using parfor loops in Matlab. I wanted to generate random permutations of indices so that I could pick out a random training and test set from the data. I also wanted to use Matlab's parallelization tools, that is, a parfor loop. Here is what the code looked like:

% Loads variable X
load ('mydata')
[N_samples,N_dims] = size(X);
N_trials = 1000;
N_training = 100;
N_test = 50;
parfor i_trial = 1:N_trials
% Seed the random number generator based on the current time so that it outputs different sequences each time.
rng('shuffle')
% Returns a row vector containing a random permutation of the integers from 1 to N.
randIdx = randperm(N);
% Assign training and test indices
trIdx = randIdx(1:N_training);
teIdx = randIdx(N_training+1:NSamples);
...
% Do caclulations
...
end

With this code, I was getting curious results. Somehow, the output seemed to be same even though it was given random parts of the data for training and testing. When removing the parfor loop and just running the usual for-loop the results were more reasonable.

I have a 4-core laptop and it turns out that, when looking at the randomly generated indices they where same in groups of four. I am not sure if my assessment is right, but somehow the rng works globally meaning that if you run a parfor loop it will only reshuffle once all the pools have done the iteration they were assigned to. If you have, 4 pools there will be four runs then a reshuffle and so on. My solution was to generate the permutations beforehand in a for loop, storing them in a matrix, and then picking them out when running the for loop.

Lesson learned: when working with random number generators, make sure they are actually random, and when combining it with parallel computing make sure that the parallelism does not affect the results of any computations.

I bought the Day One app and it made me very happy

I have been experimenting with ways of keeping a journal for my research since I started my PhD. I have done everything from simple mark down files, to email reminders and a simple tumblr log. The usual trade-off applies to journaling as well either feature bloat or too simplistic to be expressive.

Integrating the journal writing into the normal workflow is the hardest part. Text files tends to be forgotten since they are not in your face but hidden away in some folder; so, a journaling habit takes discipline to hold up. Email reminders were useful but user interface most email clients and my reluctance to spending time on email made it not work for long. The tumblr log worked good for a while; the interface is nice you can view previous posts easily. In the end I lost out on tumblr too since I have to log in and start writing - it was just too easy to put off over things that are more important. In addition, it was not as expressive and easy to survey as I wanted. Therefore, I started looking around and I found Day One.

Screen Shot 2014-03-22 at 13.21.50 

Day One is fabulous compared to most of the options out there. It supports markdown, tagging, location, and images. It has a slick but simple user interface that is easy to understand and get started with. It also has a calendar view, which makes it easy to survey if you do not write every day and it has reminders if you are not writing. It synchronizes with Dropbox or iCloud. It is customizable but not too much. The girls/guys who made it are serious geniuses. I wish it would cost more money so I could give them some.

The only thing I am missing is Latex support for some Math journaling....

Chaining function calls in Matlab

Matlab has a nice feature that lets you chain commands by treating functions as variables. This means that you can write:

a = "foo";
b = feval(a,arg1,arg2,...)

Now if we want to chain commands acting on the same input we can write:

  myfoos = cell{'foo1','foo2','foo3',...};
  x = ...
  chainedFoosOnX = applyChain(x,operators);
 
  function x = applyChain(x,operators)
    for ii = 1:length(operators) 
      x = feval(operators{ii},x);
    end
  end

Of course this example only works for one argument functions but it is simple to see how it would extend to more interesting scenarios.