Trality has been discontinued as of 31.7.2023. Thank you to all our users 💙.

The Gist of Reinforcement Learning (and Why We Should Care)

FEDERICO CORNALBA

02 December 2021 • 8 min read

The term ‘learning’ has become increasingly prevalent in our everyday lives. It is fair and safe to say that a good portion of this trend can be attributed to the phenomenal rise of Machine Learning techniques and methods within the context of Artificial Intelligence, which has gained a prominent position in a wide range of quantitative sciences, including Computer Vision (CV), Autonomous Driving, and Natural Language Processing (NLP), just to name a few.

While several Machine Learning problems can effectively be solved in a 'static' way (i.e., the learning process analyses all available data in one go), many other problems are more naturally framed in a 'dynamic' fashion (i.e., the learning process is split into consecutive steps, where each step is associated with an interpretable, meaningful action that either the algorithm or the algorithm's user can take within the scope of the given problem). These dynamically structured problems make up a Machine Learning sub-field commonly referred to as ‘Reinforcement Learning’.

Reinforcement Learning is fascinating from a conceptual point of view. But it also allows us to describe the intrinsically dynamic objects that interest us most here at Trality, namely financial markets, which is why we are kicking off a blog series on the topic. In the first piece, we'll explore the basic idea of Reinforcement Learning before explaining its potential connection with Trality's core business. Drawing our inspiration from the beautifully written report “Reinforcement Learning in Financial Market - a survey” by T. G. Fischer, we won't even need to refer to finance-related examples for much of this piece. As a matter of fact, we'll start off rather...basic.

Reinforcement Learning: A Basic Example

For the time being, we'll link all of our considerations to the following example.

Basic example: Assume you have been invited by a host to try out a new, multi-step game, which essentially involves making strategic decisions at each step, with the goal of maximising some kind of reward at the end of the game.

You'd be hard-pressed to find a more generic description, but, for the time being, that's basically all you need to know. Don’t worry, we will provide more specific instances of such a game whenever needed.

Depending on how the host decides to set things up, you might find yourself in one of two different but equally realistic scenarios.

Scenario 1. As you have never played before, the host decides to pair you with an experienced player, who has played several times before. This experienced player is able to summarize the overall game’s environment in a set of features (or ‘state’) and is also able to make predictions as to how the state is likely to change from step to step. Your involvement is limited to deciding which ‘action’ to take at each step of the game based on the predictions you receive from the experienced player.

Scenario 2. The host wants you to learn from your mistakes, and therefore forces you to play on your own. Since you are inexperienced, you are given several unofficial game rounds in which you can practice, and at each step you consider the game’s current state and evaluate your actions. You keep track of your evaluations, and keep updating them throughout each round of the game.

Machine Learning

In Scenario 1, the experienced player has ‘learnt’ over the course of years that certain states of the game are likely to lead to some other specific states. In mathematical terms, the experienced player has seen a lot of pairs (x_i, y_i), with x_i being a state and y_i being the observed state following x_i. This vast knowledge allows the experienced player to make predictions, if confronted with a new configuration x.

In many cases, this learning task can also be performed effectively by machines (in a nutshell, this is Machine Learning), provided, of course, that said machines can figure out a suitable ‘mapping’ that faithfully links the data x to the observations y. More precisely:

Machine learning is concerned with having a machine select the most accurate mapping, matching the data x and the observations y among a very large (very large!) set of possible mappings.

The input provided by humans is usually limited to:

Data/observations
Definition of all possible mappings to explore
Way of assessing how each map performs at the prediction task

The machine does everything else, namely, it uses a lot of computational power to explore the specified set of mappings and eventually picks out a suitable mapping. The high level of computational power of modern computers not only makes this exploration possible, but also the whole method feasible.

Reinforcement Learning: A Shallow Dive

In Scenario 1, the roles of the experienced and inexperienced player are ‘detached’. In other words, the inexperienced player takes the prediction of the experienced player (the only one who has gone through training) and acts upon it. This dual approach has a few limitations:

The inexperienced player has no way to assess whether the received predictions are accurate. Furthermore, even if accurate, these predictions might be calibrated for a task he/she is not interested in. Assuming that the game in question is Risk, for example, what if the experienced player had only been trained on the task of conquering the most territories, while the inexperienced player had been instructed to destroy the blue armies instead? This mismatch would likely result in a suboptimal strategy, not tailored to the inexperienced player’s needs.
At each step, the inexperienced player would most likely be fed only with a few summarising features, which reflect the experienced player’s prediction. To be fair, this makes sense in many circumstances: if the game were simply ‘decide what to wear the next day’, you certainly wouldn’t want to be given the entirety of all available meteorological data in a televised weather forecast — merely specific features, e.g., likelihood of rainfall and temperature range. However, in other cases, this can also prevent the inexperienced player from taking advantage of a full and thorough analysis of the state of the game (should he/she need to).
Any constraint which the experienced player has not been trained on is not considered. For instance, what if taking certain actions involved paying some kind of fee which the experienced player was not aware of? This would ultimately affect the inexperienced player's final reward.

To address the above issues, a further refinement of Machine Learning, called ‘Reinforcement Learning’, has been proposed. In a nutshell, this approach ‘merges’ the roles of the experienced and inexperienced players. The resulting player, who needs training on the game, explores the effect of all the actions that can be made and, crucially, judges them according to the very same criteria he/she is ultimately interested in while also taking all relevant constraints into account. In other words, the training combines the simultaneous improvement of the state predictions and of the choice of actions based on a reward function that faithfully summarises the player’s gain. This situation is roughly what we have described in Scenario 2.

Wrap Up of Terminology

We are now acquainted with the basics of Reinforcement Learning and we've already come across some of the basic terminology, which we'll reiterate here:

State: the information that can be inferred from the current environment configuration of the game.
Action: one of the possible moves that can be performed at each step of the game (i.e., one of the possible interactions with the current environment configuration of the game).
Reward: gain resulting from taking a certain action in a certain game state.

Reinforcement Learning: Pick Your Favorite Style

We have so far described Reinforcement Learning as the procedure that combines the simultaneous improvement of the state predictions and the choice of actions based on a reward function tailored to the player’s needs. Within this general paradigm, there is more classification to be done. As a matter of fact, Reinforcement Learning is further split into three different cases, which we can explain against the same game described below.

In this game, the player’s goal is to maximize the time he/she wanders around in a maze prior to hitting the first dead end. The player thus has to decide what to do at each crossroad in the maze and may only rely on two tools:

A map of all the maze’s crossroads, featuring the reward that can be gained from taking any given direction (action) at each crossroad. The information may be incomplete (e.g., since the player has not acquired any information, the map is blank at the beginning of the game).
His/her sense of smell, which directly tells the player the right action to take at each crossroad (without evaluating all possible actions). Just like the map, this sense of smell also needs some fine-tuning.

Reinforcement Learning techniques are roughly split into three main categories:

Critic-based approaches: the player uses nothing but the map, and updates its (initially blank) content at every twist and turn. The effects of each allowed action are evaluated at each step.
Actor-based approaches: the player uses nothing but his/her sense of smell, i.e., learns to directly map states to actions without vetting them all.
Actor-critic approaches: the player uses both tools. In this way, both tools (the ‘intuitive/gut-feeling’ sense of smell and the ‘rational’ use of the map) are mutually beneficial in the same way that two people with different characters learn to complement each other and, working as a pair, enhance their individual performances.

Overall, Reinforcement Learning may be summed up like this:

Reinforcement Learning is a sub-field of machine learning. In addition to the basic requirements described in Section 3 above, the machine needs to be trained to describe a dynamic interaction with a given environment. Specifically, the machine’s internal algorithms need, on the one hand, to be phrased in terms of consecutive, interpretable actions. On the other hand, they need to be driven by the reward mechanism, which is specified by the user.

Why We Care About Reinforcement Learning Here at Trality

As promised, we have substantiated our ‘Basic Example’ many times throughout our discussion in order to illustrate a few core ideas.

Playing Risk and finding our way through an imaginary maze definitely sound like entertaining activities worth taking up once in a while. However, as we hinted at the beginning of this piece, we here at Trality are ultimately interested in playing a different kind of game (in keeping with the language of this piece).

At its core, Trality has come up with a visionary platform that brings several state-of-art tools for the creation, design, and consolidation of automated trading bots (i.e., predefined strategies which are deployed for selling and buying stocks in the financial markets) to a constantly growing user base. More specifically, Trality is on track to democratize and make trading accessible to everyone (not just to those who are in the business), and it is achieving this goal by letting the user (or 'bot Creator') be the one in charge of striking a balance between straight-out-of-the-package features and higher-skill coding features that best suits him/her.

Ultimately, any given trading bot created by our users will constantly be confronted with the current 'state' of the financial environment (e.g., stock prices, non-price financial indicators, sentiment data — you name it). Consequently, the bot will need to perform some 'actions' (such as buying/selling stocks, or re-shaping users' portfolios). As you may have guessed, 'state' and 'actions' reflect the fact that this financial scenario can indeed be framed within the language of Reinforcement Learning, which we introduced above.

As we strive to offer a better and more inclusive user experience, we strongly believe that an ever-increasing degree of interpretability is a key feature that all our products will have. The intuitiveness and interpretability provided by the Reinforcement Learning paradigm make it relevant and integral for a leading fintech platform such as Trality. It is thus our intention to incorporate elements of Reinforcement Learning in our future products and to use it to craft even better tools for our bot Creators.

Looking to create your own trading algorithm?

Check out the Trality Code Editor. Our world-beating Code Editor is the world’s first browser-based Python Code Editor, which comes with a state-of-the-art Python API, numerous packages, a debugger and end-to-end encryption. We offer the highest levels of flexibility and sophistication available in private trading. In fact, it’s the core of what we do at Trality.

Alright, I get the gist of it now. But what does Reinforcement Learning look like in terms of code and what’s the mathematical foundation that makes it possible? Tune in for the second episode in this blog series on Reinforcement Learning and we’ll dig deeper.