Chaining function calls in Matlab

Matlab has a nice feature that lets you chain commands by treating functions as variables. This means that you can write:

a = "foo";
b = feval(a,arg1,arg2,...)

Now if we want to chain commands acting on the same input we can write:

  myfoos = cell{'foo1','foo2','foo3',...};
  x = ...
  chainedFoosOnX = applyChain(x,operators);
  function x = applyChain(x,operators)
    for ii = 1:length(operators) 
      x = feval(operators{ii},x);

Of course this example only works for one argument functions but it is simple to see how it would extend to more interesting scenarios.

Saving from inside a parfor loop in Matlab

If you call save from a parfor loop in Matlab, like this for example:

parfor k=1:100
  foo = exp(-k);

it will generate an error. I have not exactly figured out why this is but probably it has to do with the fluctuating nature of the foo variable inside the parfor loop. By calling a function with a filename and the variables you want to save as arguments you can get around this problem. By doing this I am guessing Matlab sees the variables as outside of the parfor block and no longer uncertain in value. The code would look something like this.

parfor k=1:100
  foo = exp(-k);
function mySave(fName, foo)

Making function returns in Matlab compact

I used to write this in Matlab:

  foo1 = zeros(1,100);
  foo2 = cell(1,100);
  for i = 1:100
    [tmpFoo1, tmpFoo2] = foo ( X );
    foo1(i) = tmpFoo1;
    foo2{i} = tmpFoo2;

This was until I realized you can write:

  foo1 = zeros(1,100);
  foo2 = cell(1,100);
  for i = 1:100
    [foo1(i), foo2{i}] = foo ( X );

Which makes sense and doesn't make sense.

Understanding the Metropolis Hasting algorithm - A tutorial

The Problem

Say you want to evaluate the expectation of a function over a random variable

\(E[f(x)] = \int p(x)f(x)dx\),

or perhaps find the max of a probability distribution, typically the posterior,

\(\arg \max p(x|y)\),

where calculating the derivate is intractable since it depends on some feature that comes from some algorithm or some unknown function.

These situations typically make us resort to approximate methods. One way of solving the expectation is to draw \(N\) random samples from \(p(x)\) and when \(N\) is sufficiently large we can approximate the expectation or the max by

\(E[f(x)] \approx \frac{1}{N} \sum\limits_{i=1}^{N}f(x_i)\),

We can apply the same strategy to finding \(\arg \max p(x|y)\) by sampling from \(p(x|y)\) and taking the max value in the set of samples. When we have acquired a large enough amount of samples, we can be fairly sure of having found the max.

This way of approximating invariably leads to the question how to sample from a distribution \(p(x)\)(or \(p(x|y)\)). However, how can we sample from a pdf that in its current form already is too complicated?

A common way of sampling from a troublesome distribution is to use some kind of Markov Chain Monte-Carlo(MCMC) method. The core idea of MCMC is to generate a Markov chain that converges to the desired distribution after a sufficient amount of samples.

Markov Chains & Detailed Balance

A Markov chain is a sequence of random variables such that the current state only depends on the previous state,

\(p(X_n|X_{n-1},X_{n-2},X_{n-3}...,X_{1}) = p(X_n|X_{n-1}) \).

To simulate a Markov chain we must formulate a transition kernel, \(T(x_i,x_j)\). The transition kernel is the probability of moving from a state \(x_i\) to a state \(x_j\); it can be either discrete or continuous. In undergraduate courses one often finds that it is discrete and formulated as a transition

Convergence for a Markov Chain means that it has a stationary distribution, \(\pi\). A stationary distribution implies that if we run the Markov Chain repeatedly, for long enough time, the samples we get for each run will always form the same distribution.

So how do we know when the kernel has a stationary distribution? Detailed balance is a sufficient but not necessary condition for a Markov chain to be stationary. Detailed balance essentially says that moving from state \(x_i\) to \(x_j\) has the same probability as moving from \(x_j\) to \(x_i\). Observe though that this does not mean that all states are equally probable. We state detailed balance more formally as follows,

\(\pi(x_i) T(x_i,x_j) = \pi(x_j)T(x_j,x_i), \; \forall x_i,x_j\).

The MCMC Idea

The key idea for sampling from a distribution, \(p(x)\), using MCMC should be clear from the above. We put \(p(x)\) as our stationary distribution, find a transition kernel, \(T\), that fulfills the detailed balance condition, and generate samples from the Markov chain. By detailed balance the Markov chain will converge to our desired distribution, and we can use the samples for calculating the expectation or finding the max, or for whatever other reason we wanted samples from the distribution.

Finding Detailed Balance

How do we find a transition kernel that fulfills the detailed balance condition? Say we pick any distribution (transition kernel), \(T(x_i|x_j)\), that we can easily sample from. It will naturally be so that we will move from some state \(x_i\) to state \(x_j\) a bit more probable, that is,

\(\pi(x_i) T(x_i,x_j) > \pi(x_j)T(x_j,x_i)\).

We can compensate for this by multiplying in another transition term, \(A(x_i,x_j)\) such that,

\(\pi(x_i) T(x_i,x_j) A(x_i,x_j) = \pi(x_j)T(x_j,x_i)\).

This means that our balancing term will have the form,

\(A(x_i,x_j) = \frac{\pi(x_j)T(x_j,x_i)}{\pi(x_i) T(x_i,x_j)}\).

If we are in state \(x_i\), and have generated a sample from \(T\), \(x_j\). \(A(x_i,x_j)\) thus becomes the binary probability of either transitioning to state \(x_j\) or staying in state \(x_i\). All we have to do to make a decision is to generate a uniform sample, \(u\), on \([0,1]\) and if \( u < A(x_i,x_j)\) we move to state \(x_j\). Sometimes we will have \(A(x_i,x_j)>1\), this means that moving from \(x_i\) to \(x_j\) is likely but the reverse is very unlikely, implying that the sample will need to automatically pass to the new state every time.

The Full Metropolis Hasting Algorithm

The full MH-algorithm thus takes the following form:

(0) For \(n = 1,2,\ldots, N\), set current state to \(x_i\) and do

(1) Sample from the proposal distribution \(q(x_j|x_i)\).

(2) Evaluate \(\alpha = \frac{q(x_i|x_j)p(x_j)}{q(x_j|x_i)p(x_i)}\)

(3) Accept the proposal sample, \(x_j\) with probability \(\min[1,\alpha]\) otherwise keep the current sample \(x_i\) as the new sample.

A short tutorial on Kernel Density Estimation (KDE)

The aim of Kernel Density Estimation(KDE) is:

Given a set of \(N\) samples from a random variable, \(\mathbf{X}\), possibly multivariate and continuous, estimate the random variables probability density function(pdf)

The univariate case

To get a rough idea how one can think about the problem, we start out with a set of samples, \(X=[x_1,x_2,...,x_N]\), of a continuous, one-dimensional, random variable(univariate) on \(\mathbb{R}\). To get a estimate, we assume that the pdf is constant on small intervals \(dx\), this means our pdf is now piecewise constant and discretized. The amount of the total probability mass falling into every small interval can be written as,

\( P_{dx} \approx f(x)dx = \frac{N_{dx}}{N}\),

where \(N_{dx}\) is the number of samples falling into a particular interval. What we have done, in other words, is to estimate the pdf by the normalized histogram of the samples from the random variable. In the animated gif below you can see such an estimate of a normal distributed random variable.


This estimate of \(f(x)\) is quite blunt because of the discretization, and thus dependent on the size of \(dx\) and the number of samples we have. We can see this in the gif above where we start out with a few set of samples and large bins, \(dx\). As we go along, we get more samples, and can make dx finer, which results in a approximation that is quite near the original normal distributions.

The multivariate case

In the multivariate case we simply replace \(dx\) by a volume \(dV\) and the estimate above becomes

\( P_{dV} \approx f(x)dV = \frac{N_{dV}}{N}. \)

We can write out the equation in units / words to get a better understanding of it:

\( \mathrm{probability\; mass} \approx (\mathrm{probability\; mass}/\mathrm{volume})* volume = \\ (\mathrm{probability\; mass}/ \mathrm{sample}) * \mathrm{samples} \)

As we move from univariate to multivariate most of the volumes, \(dV\), will contain few or zero samples, especially if \(dV\) is small. This will make our discretized estimate quite jumpy.

The KDE approach

KDE approaches the discontinuity problem by introducing uncertainty (or noise) over the sample location. If there is uncertainty in the position of the sample the probability masses assigned to sample points will spread out, influencing other \(dV\)'s, giving a better and smoother estimate. Further on, since the samples now are spread out the grid or volumes estimate transforms into a sliding window estimate.

To show how this can be done we start out by specifying a kernel, \(K(u)\), at each sample point. The kernel is equivalent to spreading out the count of a sample to the whole space. For the input to the kernel, centered at sample \(x_i\), we write, \( \frac{x-x_i}{dx} \).

The input to the kernel reads as the ratio between the to-sample distance and discrectization size. We are thus feeding the kernel a value that is linear function of the to-sample distance and the grid size.

The kernel is formulated such that the following holds,

\( \int K(u)\,du = 1 ,\; K(u) \geq 0 \).

This looks suspiciously as a pdf, and that is essentially what it is.

Choosing the right kernel is more of a data problem than theory problem, but starting with a Gaussian kernel is always a safe bet.

\(K(\frac{x-x_i}{dx}) = \frac{1}{(2\pi dx^2)^{D/2}} e^{-\frac{1}{2}(\frac{||x-x_i||}{dx})^2}\)

\(dx\) here becomes the variance of the kernel and restricts the sample count spread to be concentrated within the volume. Essentially, \(dx\) controls the smoothness of the estimate, and its thus only natural that it should affect the concentration of probability mass. We can now pin down the pdf as a sum,

\( f(x) = \frac{1}{N} \sum\limits_{i=1}^{N} K (\frac{x-x_i}{dx}) \),

where \(K\) is any kernel that fulfills the requirements stated above. Below is an animated gif of the behaviour of the KDE as function of the number of samples one with a fine \(dx\)


The blue dotted lines are individual kernel values (not multiplied by N), red the estimated density and magenta the actual density. The first image has a coarse grid value requiring only a few samples while the lower image has a finer grid requiring many more samples. This is one of the weaknesses of the approach since \(dx\) is the same over the whole space and cannot be made finer where there is lot of variation and coarser when there is not. If we choose a bad average value \(dx\) we might end up with the situation that arises in image two. Luckily there are ways to fix this but that is for another post.

A footnote, one can interpret the sum over the kernels as a mixture model where the mixture components are equal, \(1/N\), and with the same local support. If one is familiar with the EM or variational approach to Gaussian mixture models (GMM) one can easily see the shortcomings of this static approach to density estimation as well as the simplicity and ease of implementation as compared to the GMM approach for density modeling.

For a longer more thorough explanation see density estimation by Silverman. The explanation in Bishop is also quite good.

The psychology of debugging - the bias of complexity

Most of the times when you are writing code you are make mistakes either in the structure or in the specific code. These mistakes become bugs that sometimes eat up a lot of time to diagnose. Once diagnosed, bugs are usually easy to fix unless there is a big structural mistake, which makes you, rewrite the whole code. So roughly you could divide the problem into two categories: bugs that happens because you are not understanding the problem completely or you don't have good understanding of the language you are programming in, and bugs that happens because you are tired or not concentrated enough etc. Structural problems can be fixed by planning better and understanding the programming language better i.e. the hard work involved with becoming a great programmer.

Here I am more interested in are the sloppy mistakes since the payoff is bigger in fixing them but they take equally long time to diagnose as the structural imho. Usually the IDE will help you diagnose and fix a lot of those errors. But there are some that are harder to find - the sloppy mistakes that affects the structure. In a recent assignment I had we were to implement a prediction algorithm and calculate the mean error. The algorithm was complicated both mathematically and programming wise. Calculating the error on the other hand was simple - just compare the predicted with the true values. Where did I make the mistake? In the error calculation of course. I was calculating the mean of the correct answers not the wrong ones. I was getting terrible results I rewrote the algorithm three times and spent time debugging but nothing worked until I started looking at the error function.

Now why did I do that mistake? And why didn't I go there and check it the first thing I did? I would like to call the problem the bias of complexity. Since the problem was complex I believed that the bug must lie in the code of the algorithm or in the way I had structured it not in the simple code calculating the error. Therefore I did not debug my error function.

So since the algorithm was complex I was biased to believe that the error must lie there but it was really all just a sloppy mistake. So the assumption that complex code is more error prone is not as correct as one believes. When the code is complex you concentrate more and check your code more, and when the code is simple you relax and just write.

But how do we fix this? First of all write test functions even for simple code. Second, always check the simple code first and verify that it is correct indices, error functions etc. Third, reevaluate your assumptions.

The start-up evangelic

Attended the Stockholm Start-up Day 2013 event. It was really fun and the event was well organized with a good flow. It is great what people are trying to do for Swedish entrepreneurship. Not that there is a lack of really good entrepreneurs in Sweden, but the general mindset of most people is still to get a good job and not venture where people has never ventured before. I do like the culture they are trying to foster. And the event is great for meeting and talking to likeminded people of course.

For a person like me who has been doing some start-ups already and has experience dating ten years back there was not much new information to be gained from the talkers - except learning from their presentation skills. A lot of talks were rehashes of truisms like you need to have a great team; you need to foster a great culture etc. I wrote a mean haiku in the midst of those talks

Start-up evangelics,
Clapping their hands to nothing,
Just do it already.

I wish that the speakers had ditched the old advice giving presentations and actually presented a more detailed view of a certain aspects of the process of their startup like exactly what did they do to get a great team or foster a great culture. Those in depth reflections are immensely more interesting than - things I have learned in my life so far concept copying. I would have liked more talks from people that were in the process of starting up than people already finished - lessons learned tend to fade quickly.

Not to say there was interesting talks at the event - the four that stuck with me where: Chad Hamre of Ethical Ocean who gave a no slides straight up talk of him starting the company and had everyone’s attention at the end of the talk, Magnus Engervall of FlexiDrive who daringly gave lesson about economics rather than talk about his start-up - kudos for that - and Ida Östensson from Crossing Borders and Josefina Oddsberg from Bee Urban that gave great honest talks about their businesses without the superficial glossiness that can get to be a bit too much some time.

The best event of the day was the Start-up hot seat where two investors talked to three start-ups in an interrogative style. It was interesting to see how the different persons reacted and viewed their companies and how that differed from the investors view. All start-ups described themselves as the xx for xx. The investors however went straight to two things: metrics and selling points. I got more out of this 24 min session than anything other during the day. I wished there was more of this.

Partially fixing numerical underflow for mixture models

In a mixture model the probability of an event \(x\) is written

\(P(x=X) =\sum_{i}\pi_{i}P_{i}(x=X) \),

where \(\pi_{i}\) is the probability of the point belonging to mixture \(i\) and \(P_{i}(x=X) \) is the probability of the event \(x=X\) in the \(i\)-th mixture. The problem is usually that the \(P_i\) are small which makes underflow happen when you multiply them with big numbers such as \(\pi_i\) or summing up the values. Underflow is simply your computer not having enough precision to handle all those tiny numbers and so rounding errors happen. For example \(1+\epsilon = 1\) if \(\epsilon\) is really small.

To fix underflow one usually operates in the log domain i.e.

\(\log P(x=X) = \log [ \sum_{i}\pi_{i}P_{i}(x=X) ]\).

The problem with this is that the log cannot decompose sums and we still get underflow. To fix this(somewhat) we can write:

\(\log P(x) = \log [ \sum_{i}\pi_{i}P_{i}(x=X) ] = \log [ \sum_{i} \exp\big\{ \log\pi_{i} + \log P_{i}(x=X) \big\} ]\)

Now by finding the max value of the different sums \(\log(\pi_{i}) + \log ( P_{i}(x=X) )\) and deducting it we can move out most of the probability mass in the equation. It is simple if one looks at the following calculations for a mixture model with \(2\) mixture components.

\(\log p = \log [p_1 + p_2] = \log [ \exp(\log p_1) + \exp(\log p_2) ]\)
\(pMax = \max [\log p_1 ,\log p_2 ]\)
\(\log p = \log[ \exp(pMax) * ( \exp\big\{\log p_1 -pMax\big\} +\exp\big\{log p_2-pMax\big\}) ]\)
\(\log p = pMax + \log [ \exp(\log p_1 - pMax) + \exp( \log p_2 - pMax) ]\)

Now if we for example assume \(\log p_1 > \log p_2\) then
\(\log p = pMax + \log [ \exp(0) + \exp\big\{\log p_2 -pMax\big\} ] = \\ pMax + \log [ 1 + \exp\big\{\log p_2 - pMax\big\} ]\)

This means that we gotten out most of the probability mass from the sum and we have made the exponent in the exponential closer to zero and thus less small. This of course doesn't mean numerical issues will be a concern anymore but I believe we are more in the clear than before.

Below is an implementation in Matlab. Instructions in the file.


Google can keep Keep....

I used to be a Google fanboy. However the disbanding of the Google Reader together with a bunch of thoughts on it here [1][2][3] just to mention a few, left me thinking about how I treat my personal data. It also made me think about how much I rely on free services in general which might get shut down. gmail might not, and drive might not but I am guessing one should be more careful in the future before integrating something into one's workflow such that it becomes a vital part of one's life. As it is now I will not be even looking at Google Keep. It is also scary putting all eggs in one basket. Diversity usually doesn't thrive when there is only one player on the market. (And I kinda love Evernote.)

Maybe it is even time to move away from gmail and drive and set up ones own server? The software is there and the process is not nearly as complicated as before. Further on the cost is tiny nowadays and all your data will belongs to you and not us.

How to manage a (paper) deadline - 9 rules (to rule them all)

The last couple of weeks have been fraught with a paper deadline. My second one. Times have been stressful. The paper involved starting up from scratch, learning a new programming language, and how to work a robotic arm all in less than one and a half month. One of the realizations I had was that setting up some rules of conduct was necessary to manage the amount of stress and imperative for success. So I put up these nine rules:

1. Get good sleep
If you are tired you cannot think. If you cannot think you will not progress unless you are doing some repetitive task; so you might as well go home and sleep.

2. Get good sleep
There I said it again.

When you are tired everything move slower. Writing goes slower. Going to the gym becomes a drag. Meetings become a pain. You become grumpy. Your partner will get irritated by the fact that on top of you being in the lab all the time you are also being grumpy; that is if your not dating someone in the lab.

3. Get good sleep
Seriously, if you are tired and all your experiments fail everything will suck. If you are not tired the world will suck a bit less. Paper deadlines are more dependent on progress to sustain a feeling of momentum to make you put up with the stress. If things are not working out, as they should pressure will pile up. Performing research under stress and high pressure is usually not productive.

4. Stop drinking coffee
We all enjoy a good cup of java but hey during a deadline everyone has a tendency to become coffee junkies and coffee monsters. Too much coffee stresses you out if you’re under stress. Switch to tea. Days of stress I just drink green tea.

5. Reduce intensity of exercise but increase the frequency
Most of your energy goes into thinking and working during a paper deadline. Therefore it is important to take regular breaks. The best breaks involve some kind of physical activity. I intentionally reduce the intensity of my exercise schedule during deadlines and instead increase the frequency; going the gym for about an hour every day. It helps me refocus and relax without getting tired. And I came out of this deadline more fit than I was before. Further on while my colleagues started looking stressed - burned out sipping giant jugs of coffee - I still was fresh and relaxed.

5. Take social breaks
Just sitting in the lab day after day will get you momentum but it will also leave you increasingly blind to the obvious. Your mind needs time to recharge. Social breaks will let you relax and take your mind of work. Getting totally wasted on the week of the deadline might not be recommendable but going out for a nice dinner or a glass of wine certainly helps. Despite the hours you loose you will gain it in the pure momentum you get the next day when you go the lab remembering there is actually a life out there.

6. Give yourself a present every day
I think this quote from Twin Peaks sums it up well - "Harry, I’m going to let you in on a little secret. Every day, once a day, give yourself a present. Don’t plan it, don’t wait for it, just let it happen. It could be a new shirt at the men's store, a catnap in your office chair or two cups of good hot black coffee.” Basically to lessen the stress let yourself enjoy some good moments during the day. Buy a carrot cake muffin for breakfast, enjoy a long lunch, or take a sauna after the gym.

7. If work is going terrible take a day off
This might sound counter productive but when in deadline mode momentum is the most important thing you have. If you stop moving forward and feel frustrated, stressed, and in general lousy, getting out and doing something completely else might be the solution. My recent deadline was extended for a week. The moment I heard of the extension I closed my laptop and went to a nice dinner with my girlfriend and slept in the next day. Coming back to work it was like the deadline just had started. A fresh reboot.

8. Switch working environment
We all know that switching environment helps when working on a problem. The solution comes to us as we are biking home, jogging in the forest, trying to go to sleep i.e. doing something completely different from working somewhere completely different from your cubicle. During the last deadline I took two papers I wanted to read for the paper I was working on and went to a nice cafe, ordered an Americano and a carrot cake, and got to work. I came up with so many ideas in that hour that it generated two pages of text in the final submission. The ratio of work to output was thus extremely big. I realized I should do this more often.

9. Eat good
Eating frozen supermarket food for lunch, grabbing a sandwich with some crisps in the cafeteria, eating candy instead of lunch or skipping it all together to work is counter productive. Get yourself good cooked food with lots of fibers and greens. This of course should be an everyday virtue but hey we are just humans. However keeping this habit during deadline work will make you feel less like a dreaded monster loaded with sugar, fat, and caffeine.

Related pop psychology validating the rules...