The subtle differences between running experiments in the online and physical world

JULY 27, 2018
Neeraj Hirani

In continuation of our series of posts on the nuances of experimentation, we felt a good starting point would be to highlight the differences between running experiments in the online and the offline world. A/B testing is quite popular in the online world, with several leading companies like Amazon, Google, and Microsoft running several experiments in parallel each day. What is less known, however, is that this practice is also widely prevalent in the offline world. Many banks, retailers, and restaurants test ideas in a small set of locations or groups of customers to test the impact of interventions before rolling out large-scale changes. Below, we highlight fundamental differences between experimentation in these two worlds.

We have organized these ideas into three main categories:

  1. Data
  2. Experiments
  3. Measurements and Rollout


Data Collection

In the online world, since the focus is typically on optimizing conversions of a website, data collection is fairly easy. Often, a single team works on the collection process making it centralized and cheap. The data collection process typically captures information about site performance, visitor’s interactions, devices, and activities.

Data in the physical world is scattered with data spread across multiple geographies and the variety in data makes the collection and integration process harder.

Data Homogeneity

Almost all experiments in the online world use data related to site-performance, devices, browsers, visitor behavior and interactions in addition to customer and transaction data.

In the physical world, experiments are not restricted to one type of optimization. Hence, a wide variety of data is needed for experiments in the physical world. Depending on the type of experiment, one can get to see in use a wide range of internal and third-party data, such as demographic profiles, store master data, historical sales volume, seasonality and weather data — very often new sources of data can be introduced — For e.g., video feeds to analyse how one can shorten / optimize billing checkout times.

Data Quantity

Websites that run experiments typically have millions of visitors each day. With such huge numbers, one can pick up on subtle behavior changes or get answers much more quickly.

However, in the physical world, with sparse data available in experiments, one can only catch the most drastic effects. Many mid-sized retailers do not have more than 150-200 stores in total. For them, a realistic selection of test stores can be not more than a dozen stores. This often leads to challenges around matching test and control stores. To ensure experiment findings are statistically significant, experimenters must make assumptions by working with aggregated data. At other times, experimenters use proxy data due to counter the issues of non-availability of data. When testing with low data volumes, the focus is on making extreme changes and finding lightweight proxies for larger changes.

Data Quality

In the world of online experimentation, the quality of data is often perceived to be superior to the offline world, because it requires less effort in data cleansing. The data collected online is often done centrally (from a website), hence offering greater control in capturing the data in consistent formats.

Data integration in the brick and mortar world is scattered across multiple locations and may require considerable effort. We have often seen that data is stored in inconsistent formats across geographies. For example, the date fields might be stored in different formats DD-MM-YYYY or MM-DD-YY across store locations.



Most A/B tests online pertain to improving conversions (converting visitors to clicks) and feed into decisions that affect the UX of a site. Some examples include – the testing of page headlines, banner colors, sizes, text on buttons, page layouts, the number of steps in a checkout process, etc., to see which version leads customers to perform the desired action, either through an opt-in click, a sign-up or a purchase.

However, the physical world sees a far greater variety in types of experiments, ranging from areas as diverse as promotional offerings, packaging, and pricing to store-operation hours, store layouts, staff training, inventory, and supply chain. The scope and variety of experiments that can be run in the physical world across functions are large and the possibilities are endless.

Impact Comparison


The world of web experimentation is conducive to running several experiments in parallel through quick, iterative cycles of observation, hypothesis, and testing at extremely low costs by assessing site-visitors near real-time.

The time taken to plan, gather resources and execute an experiment is often longer in the physical world. However, with new age tools to drive an experimentation agenda, such challenges can be circumvented.


With web experimentation, since the focus is almost always on the company website, the experimentation teams are often centralized. This makes implementation of web experiments relatively easy to do, mostly through tools which can easily create variants just by adding a few lines of code.

Implementation difficulties often arise in the physical world due to stores being spread out across geographies and reliance on other stakeholders to administer tests. This creates the need for coordination across multiple locations across multiple teams. For example, in pricing experiments run by the marketing team, the price labels in stores would need to be manually changed by store operations staff to reflect the test prices.

Sometimes the tests entail investments which take time. For example, in designing variations in store layouts, approvals and investments need to be secured. This often leads to long delays in the time taken to even begin the experiment.

Impact Comparison


Running multiple experiments in parallel is fairly common practice in the online world. Companies like MSN, GOOG, AMZN run tens of thousands of tests on a single page, daily.

Executing multiple experiments simultaneously is harder in the physical world. Often, the number of overall locations available is small, making the sample of stores where tests will be run, even smaller. It is harder still to find like-for-like matched control stores to those test stores in order to run a fair comparison. To complicate things further, when multiple experiments are run in parallel, it becomes very tricky to isolate the impact of different interventions, especially when metrics being tracked may be linked.

Course Corrections

In web experimentation, it is often easier to make changes midway or to stop active experiments that are running because of instantaneous feedback.

In the physical world, however, making changes midway through experiments requires significant communication overhead.

Measurement & Rollout

Obtaining statistically significant results

Ideally, having a good sample size helps reduce variance in results. Furthermore, it is a faithful representation of the population (all stores or overall customers) and gives experimenters better confidence.

In theory, with sufficiently large volumes of data and the right sample sizes in web experimentation, it is often straightforward to observe statistically significant results.

In the physical world, however, with small sample sizes, it is often difficult to yield statistically significant results.

Implementing recommendations

In the web world, most test results lead to making site improvements, which can be fixed by altering lines of code. Depending on the magnitude of the experiment, the possibility to implement massive site redesigns does exist.

However, implementing changes in the offline world is particularly challenging; re-configuring the arrangement of shelves in a store is much harder and costlier than changing some CSS on a website. It requires navigating complex organizational structures to obtain stakeholder buy-in.

In the physical world, retailers typically opt for targeted roll-outs of winners – by concentrating investments on specific customers, markets, and segments, where the potential for returns is the highest.


Despite the appearance that experimentation in the physical world is hard, its returns far outweigh the costs. The overwhelming strength of field experiments lies in the external validity it provides in a cost-effective manner. The marginal costs of experimentation are tiny compared to the marginal benefits they produce. The more experiments an organization runs, the faster it can learn by failing fast and failing cheaply. Scaling the winning ideas and learning from the ones that fail will be key drivers of innovation and competitiveness for all organizations.

If you are looking to get started on your test and learn journey, we at Impact Analytics are adopting the latest developments in advanced analytics, AI, and ML to bring experimentation to the offline world. To learn more, reach out to us or schedule a demo with us today.