Don't Get Burned By Heatmaps

petercooper · on June 15, 2011

After reviewing the individual heatmaps, however, we get the impression that the site might have a bit too much going on

Normal A/B split tests have a similar issue. Let's say you have a sales page and you're testing colors. Red buy button, a blue button, a green button. And let's say blue wins overall with a high level of confidence.

Without drilling down (and most systems I've worked with are bad at this or can't do it at all) you can't tell if certain types of user actually preferred other scenarios. For example, visitors from east Asia might convert WAY better for the red button (red being a lucky color in China). And visitors coming from certain sites might always convert better on the green.

Problem is, your split test results show the overall results and so you optimize with the blue.. when your A/B tool could have analyzed the data and suggested segments for you. Does this tool exist (without getting users to guess segments up front)? If not, there's a ton of money being left on the table.

mwexler · on June 15, 2011

Actually, this is a problem with most entry level tools. Professional level tools like Omniture Test and Target or Optimost with Autonomy Segments handle MVT testing per segment, segment defined by external variables (appended data or data from Dataxu et al), business variables (return customer, prospect, etc.), or just online variables (IP address Geo, repeat vs. new visitor, etc.)

Professional level, btw, is defined as "costs a lot" and "are painful to use" and "have poor interfaces"; more recent tools give up some of the power and cost for ease of use.

However, as you point out, these easier MVT tools (or Split-Test tools) focus on showing the "winner" or "winning variables", winner defined as "most impactful on the total traffic or sample of this traffic during this time". You are right that this is not the final answer... but it's a good starting place; if you solve for the majority of users, it's usually the biggest step. Then solving for unique smaller groups can often result in lower volume but higher value conversions... if you have time.

That being said, including segment variables in the analysis is a wonderful thing to do, and I encourage it. I look forward to when Visual Website Optimizer and other "entry level" tools include these in their analyses by default... and we can look to the end of Optimost, Omniture TnT, and the other "enterprise" tools.

DanielRibeiro · on June 15, 2011

Seems to me like you are talking about Cohort Analysis: http://www.avc.com/a_vc/2009/10/the-cohort-analysis.html

But you are right, a tool for this would be nice.

lsb · on June 14, 2011

Very cool stuff, Brian! It'd be interesting to try to cluster users based on what bits of the page they look at (women 65+ look at navigation breadcrumbs significantly more than average, say).

lbarrow · on June 14, 2011

GazeHawk intern here -- I wrote the post. Our post next week is going to be focused more on using different clustering metrics as a way of making the data easier to understand.

We definitely have plans for running some big studies where we can do demographic comparisons, since that sort of information helps us bring a lot of value to our customers, but it might not be for a few weeks.

moultano · on June 14, 2011

Seems like you need a heatmap for the variance as well as the mean.

lbarrow · on June 14, 2011

That's an interesting idea... I'll see if anything comes of it.

pacaro · on June 15, 2011

This was my first thought also (although 7 hours later!), it may be interesting to look at skew as well as variance (kurtosis is probably going too far), the difference between diffuse normally distribution and split peaks...

It may even be interesting to do some kind of k-means, would be awesome to see that correlate with demographic...

roryokane · on June 15, 2011

To ensure a heatmap does not “eliminate the element of time”, the heatmap could show later-looked-at spots with more desaturated colors.

Of course, this has the same problem as the main point – that you can’t tell whether a spot is medium-saturated because everyone looks at it in the middle of browsing or because some people see it at the beginning and others see it only at the end. Still, this problem, like the article’s, could be fixed by moultano’s suggestion of making a heatmap for the variance as well as the value.

Or you could sacrifice the display of time, and display variance as saturation – less-variant, more-sure spots would have more saturated colors. This would be easier to read than two separate heatmaps for value and variance.

terzza · on June 14, 2011

During my final year project for my BSc, I tried to address the problem of losing the temporal data from eye tracking in heatmaps with an accelerated replay, heatmap animation. Here's an example:

http://www.youtube.com/watch?v=L319pLmzHVc&feature=chann...

This video shows a selection of sittings from radiologists reporting on chest xrays.

If I recall correctly it is sped up approximately 5 x realtime.

bkrausz · on June 14, 2011

Nice visualization! We have something similar that we generate:

https://s3.amazonaws.com/gazehawk-public/example_track.mp4

Definitely a step up from heatmaps, though I think there's a lot to be said for providing data as a static image rather than a heatmap, especially when they are being added to a powerpoint or PDF.

rgbrgb · on June 14, 2011

Though I'm pretty optimistic about the possibility for algorithmic user interface design (or at least quantitative evaluation), I've yet to see any conclusive studies on the matter. Anybody got any hard info on this?

lbarrow · on June 14, 2011

I think it's really difficult to rigorously evaluate that sort of thing. Different websites serve different purposes and need different designs.

rgbrgb · on June 14, 2011

Yes, I suppose different sites will need slightly different fitness functions but I'm confident in the future of automated design.

georgieporgie · on June 14, 2011

Using the example of GazeHawk, it looks like they claim to work using your system's built-in webcam. Being interested in eye tracking for non-advertising purposes, I did a few calculations, and I fail to see how this is possible at typical monitor distance and typical webcam resolutions.

Does anyone have any insight as to how this is done? Currently, I'm highly skeptical of that these heat maps are the least bit accurate.

jgershen · on June 14, 2011

What calculations did you do?

As cofounder of GazeHawk, I've written on different aspects of this topic previously [1, 2]. Is that information helpful / can you elaborate on your skepticism?

[1] http://www.gazehawk.com/blog/on-accuracy/

[2] http://www.quora.com/GazeHawk/What-broad-computer-vision-tec...

georgieporgie · on June 14, 2011

Thanks for the links, they were very interesting.

I did basic trigonometry, and came up with an estimate of about 50 pixels accuracy using a high-res, 3rd party webcam. That's why I doubt claims about built-in webcams, since they're typically pretty low resolution.

Your first link mentions an accuracy of around 70 pixels on a MacBook Pro, which is impressive but doesn't strike me as impossible (assuming FaceTime HD camera, which is 1280x720, I believe).

jgershen · on June 14, 2011

While the resolution of the webcam is (obviously) important when discerning accuracy, I feel like you may be conflating two terms here. Specifically, going from the resolution of the webcam to an estimate of so many pixels of accuracy using basic trigonometry will necessarily depend on the method you're using to convert the webcam input into eye-tracking data.

The <70 pixel figure for GazeHawk's accuracy is based on testing against real, labeled training data. That is the distance, on the screen, by which our calculated gazepoint differs from the true location at which the user's gaze was directed. It is only loosely correlated with webcam resolution, in that a higher webcam resolution corresponds to a larger pipeline - more input pixels being dumped into the eye tracking algorithm. I could be wrong, but it sounds like you're discussing the size of the eyes in the input image.

Also, at this point a discussion of accuracy vs. precision becomes germane. The use of higher resolution video as an input can often impact one but not the other.

georgieporgie · on June 14, 2011

I feel like you may be conflating two terms here.

Probably. :-)

I could be wrong, but it sounds like you're discussing the size of the eyes in the input image.

I believe I am. I assume that increased pixel count in the eye region corresponds directly with increased accuracy. This could be accomplished by either moving the camera closer to the eye, or by using a higher resolution camera.

terzza · on June 14, 2011

Modern eyetracking solutions can be quite accurate. I've used SR Research's Eyelink II [1] and have seen results from Tobii [2] eye trackers that are impressive too, considering their passive nature.

[1] http://www.sr-research.com/EL_II.html

[2] http://www.tobii.com/en/analysis-and-research/global/product...

georgieporgie · on June 14, 2011

The SR Research I can totally understand, since it's head-mounted, proprietary hardware. The Tobii is a bit more impressive, but I think it's using IR emitters to light up the pupil, and I assume it's using their own camera. Very impressive monitoring rates, though.

My doubt concerns claims to track eye movements reliably using a built-in webcam, and available light.

kjames · on June 14, 2011

Pardon my ignorance, but I find your efforts quite insignificant. If you were to ask me where most people looked on that example, I could accurately tell you what got the most looks first, second, third etc, without ever seeing your heatmap. Do I really need a third party to tell me people like boobs more than Ron Paul?

I'd like to add that the length of eye contact is irrelevant especially to a metric that is more valuable, interpretation. Let's say that we can determine (or even narrow down) users interpretation of content and heatmap the relevance to their visit. If we could take a pro-active stance we could predict future visits and adjust the content accordingly, not be reactionary and simply say (after the user is gone) that people looked at boobs more than Ron Paul.

Just my two cents, sorry if I was a dick.

potatolicious · on June 14, 2011

Did you even read the article?

The aggregate confirms what you're saying - that people will look down the middle at all the half-naked people.

But look at the individual heatmaps, it seems like a not-insignificant number of people followed more text than images, some drifted all over the place, etc etc.

That's the entire point of the post I think - pointing out that aggregates can be deceiving when the distribution is not even close to uniform.