Research evaluation upside down!


The MiFID II regulations that come into force at the beginning of 2018 give investment research consumers a stark choice: Determine whether the research that they purchase is substantive, or risk breaching the rules.

Perhaps this requirement for robust evaluation is a good thing. Research will become a commodity in its own right, to be bought and sold just like any other commercial product. It should therefore be subject to rigorous assessment.

The problem is not just how to evaluate, but what to evaluate and how to use the resulting data. This article looks at what form a new evaluation methodology should take, and how best use the resulting data. A more in-depth look at evaluation is available here.

Why evaluate?

Compliance: MiFID II obligates the buy-side to track and justify their research purchase decisions.
The key points of the new rules are:

  • Firms must formulate a clear methodology that establishes how they are to pay providers for research before they receive and consume services. This should include setting measurable ex-ante criteria as to how it will value the types, level and quality of service.

  • Firms should have agreements with research providers prior to receiving substantive services.

  • Firms can negotiate research prices ex-ante with suppliers, however, any ex-post variation in payments made to the research firm based on actual services received should be made in a proportionate and predictable manner based on the measurable (ex-ante) criteria.

  • Firms must evaluate services using qualitative and quantitative frameworks, consistently across providers.

  • Regular assessments of research consumption should be conducted to evaluate future procurement decisions and research payment levels.

In the past, regulators have often given an unofficial grace period to allow affected firms to catch up, and this is likely to be the case again. Nonetheless, the buy-side will have to show that they have documented procedures and proper records pretty much from the onset.

Procurement: A robust procurement process is always best-practice and an objective evaluation methodology should be part of that process.

Evaluation should take place at several points in the procurement cycle:

Product Requirement Evaluation

Firstly, users’ requirements should be gathered and used to develop a product specification. Pre-unbundling, providers were often selected for their execution services rather than their research, and research providers that did not offer execution were sometimes excluded. Unbundling will enable the buy-side to consider research from a much larger range of suppliers and should help them to match products to their requirements more accurately.

This primary evaluation should be revisited on a regular basis to ensure that there is no specification creep, especially in terms of regulatory obligations.

Product Performance Evaluation

The purchased research should be reviewed on a regular basis. This evaluation should be multifaceted, not only grading quality in terms of user satisfaction and adherence to the product specification, but also for the comparison of similar products in the market place. This can only be achieved if evaluation data is shared: Today’s post-purchase appraisal should become part of tomorrows pre-purchase discovery.

How to evaluate

A new method is needed. The much-denigrated Broker Vote system’s limitations are well-documented (lack of granularity, bias against new or niche players, favors large institutions etc.). MiFID II signals its demise, as Broker Vote’s core principle is that ratings set price after consumption. MiFID II, however, requires prices to be set beforehand, forcing the separation of evaluation from cost. Any new evaluation methodology will therefore need to focus solely on rating the quality of research products.

The new system must be flexible, transparent, objective and easily adopted. To find it, the industry will have to look outside financial services for inspiration.

  • Flexibility is a prerequisite as the model will need to cope with a range of different formats– whilst the majority of research is in the form of written reports, research in other forms such as meetings, phone conversations, and video are significant (and perhaps substantive) enough to require evaluation. Adopting a standard methodology will be beneficial to all research stakeholders – consumers will use ratings as a guide for purchase decisions, and producers will adapt their offerings according to the feedback received.

  • Transparency is necessary to fulfill regulatory requirements and to give credibility. It is, however, difficult to achieve if an overall, at-a-glance grade is the first visible score. To solve this problem, primary and secondary grades should be layered so that users can drill-down through the primary grade to the component labels and ratings. The number of ratings should be clearly visible, as should the categorization of the product.

  • Objectivity will be achieved once there is sufficient evaluation data to ensure that outlying scores do not cause distortion. This is only possible if the rating data is shared and consolidated.

  • Easy Adoption. Several major firms have rating models that are more complex than a simple five- star system, yet have become widely used by diverse groups of customers. That said, an evaluation should not take more than a couple of minutes, and the rating system should be simple, clear and easy-to-use.

Global companies like eBay, Uber and Airbnb are successful because of their rating systems. In general, their complexity is proportional to the cost of the product. Uber uses a rudimentary five-star grading with optional labels, whereas eBay and Airbnb are more sophisticated. A simple system such as Uber’s is adequate given the low price of the service in absolute terms. eBay and Airbnb use more complex systems combining a primary, overall rating with secondary ratings on specific aspects of the services or goods provided.

It seems that full-service global coverage from a bulge-bracket bank may cost up to $10,000 per user per year. Given the significant cost, it is clear that a one-dimensional grading model is not suitable for research. A derivative of the model used by eBay and Airbnb could form a sound basis for rating research, as the multiple layers will enable consumers to grade the various criteria that will form part of such an evaluation method, including a “substantivity” test as implicitly required under MiFID II.

A problem with many grading systems is that they eventually become profligate – drivers with Uber ratings below 4.6, and private sellers on eBay with ratings below 98.4 struggle to do business. As there is no vetting prior to admission - anyone can offer their services or products - the rating system itself filters out less performant offerors over time. This is inefficient because the range of scores becomes so narrow that it is impossible to distinguish between the great and the good. This may not matter when choosing a taxi as the prices from different drivers will be the same, however, it becomes very important when two offerings for a similar product (e.g. European Equity Research) have a large price difference.

Ideally, consumers will be able to purchase research from a trusted marketplace where all the offerings have met certain standards prior to their inclusion. This will both save time and ensure that the scores are graduated enough to identify differences. When combined with price and precise coverage information, research buyers will have real choices based on a large number of objective ratings – participation in most commercial grading models is optional, not so for research.

What to evaluate

Another issue is deciding what should be evaluated. Some discussions have concluded that evaluations should only be made at the analyst or even firm level. This is a mistake, as such an approach will have little granularity and will not cope with the huge changes to the industry that are likely to follow MiFID II. As research becomes a significant cost, consumers will demand more choice. For example, socially-conscious or “green” funds will only want to buy research on equities that match their eligibility criteria. Other factors such as geography, sector and size may not be relevant. Unless individual research reports have been evaluated, it will be nigh-on impossible for the green fund managers to select the appropriate research using measurable and objective criteria.

Analyst- and firm- ratings can be extrapolated from the ratings given to individual units of research, providing that firms performing evaluation using a standardized methodology and agree to share results – doing so on a strictly anonymous basis make the most sense.


The four tests for “substantivity” set out in the FCA handbook (COBS 11.6.3R (3)(c)(ii) are eminently sensible and should be at the heart of any evaluation method. They are:

  1. Add value to investment decisions via new insights

  2. Represent original thought, not repeat what has been said before

  3. Have intellectual rigor and not state the obvious

  4. Have meaningful conclusions based on analysis

The easiest way to incorporate them is to use a model that combines labels and ratings. Labels work by enabling the evaluator to rapidly provide feedback that is richer than a grade on a linear scale. Such labels are grouped together as responses to a question. Ratings, on the other hand, provide criteria of common interest to most potential assessors that are graded on a linear scale (typically 1-5 stars).

Is it possible to aggregate a series of separate rating criteria into one meaningful score?

Airbnb and eBay have side-stepped this issue by using only the primary grade in their headline scores. Both, however, allow the user to drill-down to find the aggregated results of each category score. This is perhaps the best solution to the requirement of having easily comparable results at-a-glance, however, it creates a new problem, namely that the master rating may have little or no relation to the component ratings, which would be inconsistant and possibly misleading. Although it is problematic to produce a master score from a subset of varied components, it is possible to build a sanity check based on them, as it does not need to be so accurate. For example, if the primary rating was a nine-star grade (odd numbers work better as a neutral score is impossible), should an evaluator chose a positive score (e.g. seven stars) having given negative grades and labels in the components, the sanity-check might show a range based upon these negative scores (e.g. two – five stars) and suggest the evaluator reconsiders…

The research industry is going through radical change and this will accelerate post-MIFID II. A new evaluation methodology is needed, as the old methods are no longer fit-for-purpose. The inspiration for the new system will most-likely come from outside the financial Services industry. The cornerstones will be flexibility, transparency, objectivity and ease of adoption.

The new model will have a variety of applications. Initially ratings will be made post-purchase, primarily for regulatory reasons. As the amount of data grows, and providing results are shared, it will become an important tool in pre-purchase decision making. It will also be used to measure performance internally, and help solve the old problem of to make or to buy, as many buy-side firms expand the own research departments to reduce costs.

A new method for the assessment of research is urgently required, and Alphametry's data-driven technology places objectivity at the heart of their evaluation model. Given the significant cost of research, isn’t it time for evaluation to be meaningful?

Ian Spittlehouse

A wholesale market expert (Eurex, Citi) with over 25 years of industry experience, currently heads Alphametry in the UK.