AB testing – a DIY approach. Part 1 – Test Allocation, who gets what?
AB testing is a complex and difficult beast. There are a few commercial offerings, the most notable of which is Optimost, which was bought out by google last year and integrated into the AdWords site. They have been pushing the agenda on this, because of course better conversion for their customers, means more PPC spending, so in the end google themselves stand to reap the rewards. I am not going to discuss these packages, but i am going to give an overview of how to build your own AB testing framework, as i did exactly this for a large, but cannot-be-mentioned dotcom.
This is a really big subject area, and this is just an introduction, so lets break it down into the following areas;
1. Collecting Data
2. Presenting Tests
3. Analysing Results
Collecting data at its simplest level, is just registering which customers did what. Generally the KPI that is most interesting (due to its low variance, unlike other KPIs such as avg margin/unique) is conversion. This is simply the number of conversions made divided by the number of unique visitors that visited the site. For AB testing, when there are only A and B groups, you would calculate these seperately for each group.
Example;
‘A’ group had 445 unique visitors, and 34 conversions. It has a conversion rate of 34/445 which is 7.6%.
‘B’ group had 421 unique visitors, and 29 conversions. It has a conversion rate of 29/421 which is 6.8%.
Was there a significant difference? This will be covered in the confidence calculations articles
For the number of people required to get confidence, this is less about people than it is about conversions. We aimed to get 10k conversions per group to obtain significant results. Of course if the test is a bigger dial mover then it will happen much quicker. The further down the funnel your test page is, the quicker you will get your results. If the test is on your landing page where you convert, then 100% of the visitors much hit this page to convert. If its on a side information page, then not all must hit to convert, and so it will take longer to get results, as you must only count the people that hit this page and did or did not convert.
The key part about the data collection is that you divide your users equally between the groups. This can be done using IP, a cookie or at the server level. It is of utmost importance to make sure that your allocation method has these properties;
- Repeatable at the visitor level, so every time they revisit the site during the period of testing they will see the same test version.
- Allocates evenly between groups.
We found the cookie based allocation method was best of all. The reason being that ;
IP’s in parts of the world cluster around large proxies, so a lot of users from certain countries get placed in A or B (or C or D etc), and as users from different countries generally convert at reliably different levels, this puts a bias on the test. Certain regions, such as europe, dont suffer from this problem as much.
Traffic diverting has the problem that it depends on the servers performance. Not all servers perform equally, and actually managing the allocation is not straightforward as it involves dealing with load balancers etc.
Cookie based allocation turn out to be beautiful. This is based on asp.net, where the cookie ID is actually a GUID (globally unique identifier). I did some analysis on generated GUIDs and they turned out to be almost perfectly random. With a test program I had written that produced a million new GUID’s per group, we had each group to within less than a hundred of each other. However, the really really cool thing about GUIDs is that they are hexadecimal, and contain 32 hex characters. Like this :
28121de6-85cc-4aef-acbc-19c2e5cb57d3
Why is this so good? Well the reason is, you want to have more than one test running at once. You also want to have that being repeatable per person, and you want to be able to have more than just AB. Maybe ABCD or ABCDEFG and you are verging on multivariate possibilities.
So somewhere we have a matrix that says for this test ID, which slots belong to which groups.
01234567890ABCDE
A xxxxxxxx
B xxxxxxxx
C
D
E.. etc
So if the test ID is say 21. We do a modulus 32 on this (this divides by 32 and returns the remainder). Which in this case gives us 21.
If we then take the 21st element of the GUID, we get ‘1′. The 1 then goes into our lookup table and tells us this is an A user. Thats it! Then all the presentation tier has to do is show the A version for that user.
Thats all for now, my hands are tired. We need to cover a lot more stuff, such as analysis, the presentation tier, and tons of other problems in this and other areas. Go wild with the comments if you have questions i will do my best to answer them all.
Now for a break from building our software and to talk about something completely different – ab testing.
AB testing is a complex and difficult beast. There are a few commercial offerings, the most notable of which is Optimost, which was bought out by google last year and integrated into the AdWords site. They have been pushing the agenda on this, because of course better conversion for their customers, means more PPC spending, so in the end google themselves stand to reap the rewards. I am not going to discuss these packages, but i am going to give an overview of how to build your own AB testing framework, as i did exactly this for a large, but cannot-be-mentioned dotcom.
This is a really big subject area, and this is just an introduction, so lets break it down into the following areas;
1. Collecting Data
2. Presenting Tests
3. Analysing Results
Collecting data at its simplest level, is just registering which customers did what. Generally the KPI that is most interesting (due to its low variance, unlike other KPIs such as avg margin/unique) is conversion. This is simply the number of conversions made divided by the number of unique visitors that visited the site. For AB testing, when there are only A and B groups, you would calculate these seperately for each group.
Example;
‘A’ group had 445 unique visitors, and 34 conversions. It has a conversion rate of 34/445 which is 7.6%.
‘B’ group had 421 unique visitors, and 29 conversions. It has a conversion rate of 29/421 which is 6.8%.
Was there a significant difference? This will be covered in the confidence calculations articles
For the number of people required to get confidence, this is less about people than it is about conversions. We aimed to get 10k conversions per group to obtain significant results. Of course if the test is a bigger dial mover then it will happen much quicker. The further down the funnel your test page is, the quicker you will get your results. If the test is on your landing page where you convert, then 100% of the visitors much hit this page to convert. If its on a side information page, then not all must hit to convert, and so it will take longer to get results, as you must only count the people that hit this page and did or did not convert.
The key part about the data collection is that you divide your users equally between the groups. This can be done using IP, a cookie or at the server level. It is of utmost importance to make sure that your allocation method has these properties;
- Repeatable at the visitor level, so every time they revisit the site during the period of testing they will see the same test version.
- Allocates evenly between groups.
We found the cookie based allocation method was best of all. The reason being that ;
IP’s in parts of the world cluster around large proxies, so a lot of users from certain countries get placed in A or B (or C or D etc), and as users from different countries generally convert at reliably different levels, this puts a bias on the test. Certain regions, such as europe, dont suffer from this problem as much.
Traffic diverting has the problem that it depends on the servers performance. Not all servers perform equally, and actually managing the allocation is not straightforward as it involves dealing with load balancers etc.
Cookie based allocation turn out to be beautiful. This is based on asp.net, where the cookie ID is actually a GUID (globally unique identifier). I did some analysis on generated GUIDs and they turned out to be almost perfectly random. With a test program I had written that produced a million new GUID’s per group, we had each group to within less than a hundred of each other. However, the really really cool thing about GUIDs is that they are hexadecimal, and contain 32 hex characters. Like this :
28121de6-85cc-4aef-acbc-19c2e5cb57d3
Why is this so good? Well the reason is, you want to have more than one test running at once. You also want to have that being repeatable per person, and you want to be able to have more than just AB. Maybe ABCD or ABCDEFG and you are verging on multivariate possibilities (2 x 2 x 2 x 2 or 2 to the power 4, so thats 4 multivariate variables you could run simulataneously).
So somewhere we have a matrix that says for this test ID, which slots belong to which groups.
01234567890ABCDE
A xxxxxxxx
B xxxxxxxx
C
D
E.. etc
So if the test ID is say 21. We do a modulus 32 on this (this divides by 32 and returns the remainder). Which in this case gives us 21, as 21 is less than 32.
If we then take the 21st element of the GUID, we get ‘1′. The 1 then goes into our lookup table and tells us this is an A user. Thats it! Then all the presentation tier has to do is show the A version for that user.
Thats all for now, my hands are tired. We need to cover a lot more stuff, such as analysis, the presentation tier, page types, and tons of other problems in this area, never mind all the other SEO stuff i have going on in my head. I made now headway into those areas yet, maybe tomorrow or the day after. Go wild with the comments if you have questions i will do my best to answer them all.