Factor selection is among our most important considerations when building financial models. So, as machine learning (ML) and data science become ever more integrated into finance, which factors should we consider for our ML-driven investment models and how should we select among them?

These are open and critical questions. After all, ML models can help not only in factor processing but also in factor discovery and creation.

### Factors in Traditional Statistical and ML Models: The (Very) Basics

Factor selection in machine learning is called “feature selection.” Factors and features help explain a target variable’s behavior, while investment factor models describe the primary drivers of portfolio behavior.

Perhaps the simplest of the many factor model construction methods is ordinary least squares (OLS) regression, in which the portfolio return is the dependent variable and the risk factors are the independent variables. As long as the independent variables have sufficiently low correlation, different models will be statistically valid and explain portfolio behavior to varying degrees, revealing what percentage of a portfolio’s behavior the model in question is responsible for as well as how sensitive a portfolio’s return is to each factor’s behavior as expressed by the beta coefficient attached to each factor.

Like their traditional statistical counterparts, ML regression models also describe a variable’s sensitivity to one or more explanatory variables. ML models, however, can often better account for non-linear behavior and interaction effects than their non-ML peers, and they generally do not provide direct analogs of OLS regression output, such as beta coefficients.

### Why Factors Should Be Economically Meaningful

Although synthetic factors are popular, economically intuitive and empirically validated factors have advantages over such “statistical” factors, high frequency trading (HFT) and other special cases notwithstanding. Most of us as researchers prefer the simplest possible model. As such, we often begin with OLS regression or something similar, obtain convincing results, and then perhaps move on to a more sophisticated ML model.

But in traditional regressions, the factors must be sufficiently distinct, or not highly correlated, to avoid the problem of multicollinearity, which can disqualify a traditional regression. Multicollinearity implies that one or more of a model’s explanatory factors is too similar to provide understandable results. So, in a traditional regression, lower factor correlation — avoiding multicollinearity — means the factors are probably economically distinct.

But multicollinearity often does not apply in ML model construction the way it does in an OLS regression. This is so because unlike OLS regression models, ML model estimations do not require the inversion of a covariance matrix. Also, ML models do not have strict parametric assumptions or rely on homoskedasticity — independence of errors — or other time series assumptions.

Nevertheless, while ML models are relatively rule-free, a considerable amount of pre-model work may be required to ensure that a given model’s inputs have both investment relevance and economic coherence and are unique enough to produce practical results without any explanatory redundancies.

Although factor selection is essential to any factor model, it is especially critical when using ML-based methods. One way to select distinct but economically intuitive factors in the pre-model stage is to employ the least absolute shrinkage and selection operator (LASSO) technique. This gives model builders the facility to distill a large set of factors into a smaller set while providing considerable explanatory power and maximum independence among the factors.

Another fundamental reason to deploy economically meaningful factors: They have decades of research and empirical validation to back them up. The utility of Fama-French–Carhart factors, for example, is well documented, and researchers have studied them in OLS regressions and other models. Therefore, their application in ML-driven models is intuitive. In fact, in perhaps the first research paper to apply ML to equity factors, Chenwei Wu, Daniel Itano, Vyshaal Narayana, and I demonstrated that Fama-French-Carhart factors, in conjunction with two well-known ML frameworks — random forests and association rule learning — can indeed help explain asset returns and fashion successful investment trading models.

Finally, by deploying economically meaningful factors, we can better understand some types of ML outputs. For example, random forests and other ML models provide so-called relative feature importance values. These scores and ranks describe how much explanatory power each factor provides relative to the other factors in a model. These values are easier to grasp when the economic relationships among the model’s various factors are clearly delineated.

### Conclusion

Much of the appeal of ML models rests on their relatively rule-free nature and how well they accommodate different inputs and heuristics. Nevertheless, some rules of the road should guide how we apply these models. By relying on economically meaningful factors, we can make our ML-driven investment frameworks more understandable and ensure that only the most complete and instructive models inform our investment process.

**If you liked this post, don’t forget to subscribe to Enterprising Investor.**

*All posts are the opinion of the author. As such, they should not be construed as investment advice, nor do the opinions expressed necessarily reflect the views of CFA Institute or the author’s employer.*

Image credit: ©Getty Images / PashaIgnatov

#### Professional Learning for CFA Institute Members

CFA Institute members are empowered to self-determine and self-report professional learning (PL) credits earned, including content on *Enterprising Investor*. Members can record credits easily using their online PL tracker.