# Capstone Project: Investment and Trading

**Project Overview:**

This project is about the idea of using AI to predict stock trends for different time frames. It combines methods to download and explore data from different tickers, to get an idea of how well a stock price is performing in future. For that, it uses multiple ML-models to conduct the predictions.

The program consists of multiple python classes and a Jupter notebook, which is divided into three main sections:

- Prepare Data

(Download and Clean and calculate features) - Analysis Datasets

(Provides statistical numbers and visualizations to explore the dataset) - Forecasting

(Performs predictions via KI)

The notebook provides widgets, which enables the user to set a list of ticker symbols, statistical values, a range for historical data load and time frames by which the prediction should happen.

During the implementation of this program, I was going to use the yfinance API, which provided daily historical data for a large kind of tickers. Unfortunately, the API is facing some issues since 2nd of July 2021, thus,

I decided to download CSV files from the tickers Nvidia and Daimler AG ranging from 22.02.2010 until the 5th of July 2021.

Once the API is back online, the notebook should make automatically use

of it.

As a standard setting, the program provides forecasts for 7, 14 and 30 days.

Based on the given architecture, the classes can be used to build a web or mobile app later.

**Data Preprocessing:**

To grep data, the program is using the “yfinance API”. It queries stock information from yahoo. To start with, the user adds ticker symbols into the form field and specifies the time range for the historical data load.

The API returns a data frame, which consist of the date as the index, “open-price”, “min-price”, “max-price”, “close-price” and the “volume” of traded stocks.

Additionally the user can set up multiple forecast windows and adjust some statistical rolling indicators, which are part of the features columns the KI is learning with.

During the data processing, the program provides in total seven different indicators:

- Rolling Standard deviation
- Simple moving average (two)
- Upper- lower Bollinger band
- Daily returns
- Cumulative returns
- Relative Strength Index (RSI)

However, before I going to calculate the features above, I do some data cleansing tasks to provide a dataset, that is better for analyzation and visualization.

Usually, the API only provides data from trading days. That means, if you would make a visualization of the stock trend, there would be some gaps. Because of that, I am going to pick the first and the last date index of the downloaded dataset and create a new data-frame with the entire date range, hence weekends and public holidays are also included. Next, I join the given dataset with the new data-frame and using the front fill method to provide a continuous data.

**Data Exploration & Visualization:**

To get some insight of the given data sets, the section “Analysis Datasets” provides global statistical data like mean of daily return, the cumulative return, which describes the total return of a stock since the beginning of the record, standard deviation or the sharp risk ration.

A lower number of the standard deviation indicates if the stock is of less risk of high variability or better say, the volatility of a stock.

The sharp risk ratio indicates if a stock could generate a higher return in comparison to a risk free market investment like putting money to a bank account with some interest. As a conclusion. The higher number of the sharp risk ratio, the better are returns over time.

For each stock, I provide global statistics for a one-year period and for the entire dataset. The one-year observation is actually interesting to see how a stock is performing lately. The total number is more important for the prediction, which I going to explain below in this text.

The program also provides a bunch of visualizations for the data analyzation part. The first visual denotes the trend of the current stock combined with additional indicators.

I used ploty to be able to drill into the figure. A closer look shows how the actual price is moving between the “Bollinger Bands”. A trading strategy says, buy if the stock price is below the lower Bollinger band, and sell if the stock price is above the upper Bollinger band. *(An automatically notification could be actually a nice extension for this program.)*

However, we can see that the stock trend seems to be volatile over the years. To get a closer look, I created a histogram of the distributions from the daily returns.

Here, we can see that the stock has mostly a slight increase. However, there are also high number of ups and downs, that explains the volatility of the stock price. So way is this from interest?

My assumption here: The more volatile a stock, the harder to predict.

I also want to show the stock trend of Nvidia. The figure shows that a long period of the dataset looks linear, but at the end, there is a high increase. Since the training data set is mostly with low volatility, it will be interesting to see how the algorithms perform with does data.

The last figure I provide in the analyzation section is a correlation matrix. I going to use this plot to check, whether the features I select, have influence to the stock price or not.

We can denote that the most features have a very strong positive influence into the price. Only RSI and the daily return seems to have no influence at all. I could observe the same with the Nvidia stock.

However, for this project, I decided to keep both features within the dataset, but for the next development iteration, a feature selection based on the correlation values would be an important feature.

**Further data preparation:**

Before I can start with the prediction of the stock price, the dataset needs to be further prepared.

Currently I have no y_train/ y_test target where a supervised model can be trained on.

In order to implement this, I take the “price” column and shift it by the number of days, I want to forecast.

For instance, if I want to predict the price in five days, I make a copy of the price column and shift it by the number of five like shown in figure below. Afterwards I adapt the index column. Otherwise, I would have a misleading date index.

Now I’m able perform the split into the test and training dataset. In this project, the ratio between training and test 75% to 25%.

It is also important to understand, that for each window prediction, the dataset must be adapted first.

# Metrics:

To benchmark the results of every ML algorithm, it is necessary to define some statistical criteria’s. For this project I made my choice for three performance indicators, which I want to describe a bit closer in this section.

**Mean square error (MSE):**

The first indicator, I want to talk about is the mean square error (MSE). This value describes the variance of the predicted value. For example, if we imagine a two dimensional plane with two dots. First dot is the expected value, and the second dots is the predicted value, the number of the MSE describes how close the predicted dot is ranging around the expected dot. The smaller the value, the better are my predictions.

**Root mean square error (RMSE): **The second indicator I take into consideration is the root square error (RMSE). Like the name describes, the RSME is the root of the MSE. It is actually the standard deviation between the predicted and the estimated values. Since we want to know the distance between predicted an actual stock price, the RSME describes it in a one-dimensional manner. Same as MSE, the smaller the better.

**Mean absolute error (MAE):**

The third indicator is the mean absolute error. It describes the accuracy of a prediction. To use this metrics, the values, which are to compare, must have the same dimension. In my project, this also given. Because of that, I will also consider this value.

**Coefficient of determination (R2): **The coefficient of determination (R2) or r-square is a statistical metric which describes the distance between the predicted values and the adjustment line in percent, whereby 0% means the distance from the adjustment line is huge, hence the quality of the model is bad or 100%, that indicates the predicted values are close to the adjustment line. Thus, the model is good. It is one of the most important metrics to say if a ml model is performing well.

Since all of this metrics are usefull the describe the results of my ML model,

I will consider all four of them.

**Algorithm Technics and **Evaluation**:**

For the experiment, I decided to use three different machine-learning (ml) algorithms, which all belongs to the class of supervised machine learning algorithms:

- Linear Regression
- Multi-layer Perceptron
- LSTM

The idea is, to create a model of each ml technique and compare the results by the numbers of the:

- coefficient of determination (R2),
- mean square error (MSE)
- mean absolute error (MAE)

In the notebook, I also added the ratio of accuracy, but since there are a lot of rounding errors, the expressiveness of this number is low.

**Linear Regression:**The linear regression model is an approach that tries to explain an observed variable by the use of different independent variables. In other words, it tries to identify the relationships between the

target and the dependent variables, whereby the target is the linear combination of the regression coefficients [1]. As input, I take the entire feature list.

**Multi-layer Perceptron:**The Multi-layer perceptron (MLP) belongs to the class of the artificial neural network (ANN) which consists at least of three layers of nodes (input layer, hidden layer and output layer). All the given nodes except from the input layer are neurons that uses a nonlinear activation function [2].

*[1] Source: Lineare Regression, https://de.wikipedia.org/wiki/Lineare_Regression [2] Source: Multilayer perceptron, https://en.wikipedia.org/wiki/Multilayer_perceptron*

Since neuronal networks are more effective by using normalized values, I perform a normalization of the entire dataset between 0 and 1.

For the MLP model, I set the maximum number of iterations to 1000 and the number of hidden layers to 100.

max_iter =1000

hls =100MLP = MLPRegressor(random_state=0, max_iter=max_iter, hidden_layer_sizes=(hls,),

activation='identity',

learning_rate='adaptive').fit(x_train_scaled, y_train)

By setting the number of hidden layers to 100 I could gain very good results, but it was also necessary to increase the number of maximum iterations to 1000, otherwise the algorithm runs into an error, because it was is able to train with the entire dataset.

**LSTM:**

The term LSTM is the abbreviation for “long short-term memory” and belongs to the class of the artificial neural networks as well. The main difference to other neuronal networks is the memory function by using three types of gates: The input gate, remember and forget gate and the output gate [3].

Like with the MLP model, I also perform a normalization on the dataset first. Next, I set up a model with three input layers using 100 units each and one output layer.

The batch size is set to 4 and the epochs (iterations) is set to 6.

model = Sequential()# add first layermodel.add(LSTM(units=100, return_sequences=True, input_shape=(x_train_data.shape[1], 1)))

model.add(Dropout(0.2))# add second layermodel.add(LSTM(units=100, return_sequences=True))

model.add(Dropout(0.2))# add third layermodel.add(LSTM(units=100, return_sequences=False))

model.add(Dropout(0.2))model.fit(x_train_data, y_train_data, batch_size=4, epochs=6, verbose=0)

I have chosen the settings for MLP and LSTM based on the results of several test runs, whereby these settings lead to the best performance yet.

[3] Source: LSTM, https://de.wikipedia.org/wiki/Long_short-term_memory

## Refinement:

In the section above, i have already provided the final parameter setting of each ML model. However, in this section I want to show, how difficult it is to set up the correct parameters in respect to the performance. As an example a take the LSTM, since with this algrithmn the parameters seems to have the biggest influence.

Like described above, I created the model with three input layers and one output layer. Since the output layer must be one, because of one target column it is possible to set the numbers per unit of the input layer, the batch size, and the number of epochs.

#units=50

#batch=20

#epochs=4

LSTM R2: 0.8343208888037233

LSTM MSE: 24.18871589943589

LSTM RMSE: 4.918202506956773

LSTM MAE: 3.6720203234698316

Test loss: 0.004514301661401987

Test accuracy: 0.004514301661401987

Accuracy: 0.45%#units=50

#batch=10

#epochs=2

LSTM R2: 0.8529816884815744

LSTM MSE: 21.464288066592907

LSTM RMSE: 4.632956730490034

LSTM MAE: 3.1555438136686225

Test loss: 0.004005847033113241

Test accuracy: 0.004005847033113241

Accuracy: 0.40%#units=50

#batch=1

#epochs=4

LSTM R2: 0.9393601858190183

LSTM MSE: 8.853253900430836

LSTM RMSE: 2.975441799200723

LSTM MAE: 2.27944727267998

Test loss: 0.0016522685764357448

Test accuracy: 0.0016522685764357448

Accuracy: 0.17%#units=100

#batch=8

#epochs=8

LSTM R2: 0.9373253905196233

LSTM MSE: 9.15032867983537

LSTM RMSE: 3.0249510210638735

LSTM MAE: 2.350033736063246

Test loss: 0.0017077106749638915

Test accuracy: 0.0017077106749638915

Accuracy: 0.17%#units=100

#batch=2

#epochs=4

LSTM R2: 0.9441181986701426

LSTM MSE: 8.158596497510764

LSTM RMSE: 2.856325698779949

LSTM MAE: 2.2533935607262094

Test loss: 0.0015226254472509027

Test accuracy: 0.0015226254472509027

Accuracy: 0.15%#units=100

#batch=4

#epochs=6

LSTM R2: 0.9518972432497814

LSTM MSE: 7.02287638199774

LSTM RMSE: 2.650071014519751

LSTM MAE: 2.052696560230034

Test loss: 0.001310668420046568

Test accuracy: 0.001310668420046568

Accuracy: 0.13%#units=100

#batch=4

#epochs8´

LSTM R2: 0.9573160407850155

LSTM MSE: 6.231746147474334

LSTM RMSE: 2.4963465599700565

LSTM MAE: 1.9181539341105456

Test loss: 0.0011630207300186157

Test accuracy: 0.0011630207300186157

Accuracy: 0.12%#units=100

#batch=4

#epochs=10

LSTM R2: 0.9551101251093278

LSTM MSE: 6.553804053217859

LSTM RMSE: 2.5600398538338927

LSTM MAE: 1.987840666826167

Test loss: 0.0012231270084157586

Test accuracy: 0.0012231270084157586

Accuracy: 0.12%585387

As can be denoted from the result list, by increasing the number of units and decreasing the number of batches I was able to increase the R2 value from 0.834 to 0.944. But the decrease of the batch size, which is actually the number of tupels per iteration results in much longer training time. Same of the parameter epochs, which indicate the repetition of the train.

Becaues of that, i decided to take units=100, batch=4 and epochs=6 as my final setting, since i could gain the best tradeoff between quality and time.

**Benchmark & Results:**

Now let’s take a look at the results after running the last section of the notebook. For the first observation, we are looking to the Daimler stock on a seven-day forecast.

7 days out:------------

Linear Regression

Linear Regression R2: 0.9599340051810165

Linear Regression MSE: 5.849530208769271

Linear Regression RMSE: 2.418580205155345

Linear Regression MAE: 1.8381742201585387

Accuracy: 0.026061776061776062

With the linear regression model (LR) and the MLP model, the coefficient of determination (R2) is at 0.96. The mean square error is at 5.8, the RMSE is 2.4 and the mean absolute error is at 1.8

`Muli-layer Perceptron`

Muli-layer Perceptron R2: 0.9597540943486109

Muli-layer Perceptron MSE: 5.875796718656164

Muli-layer Perceptron RMSE: 2.4240042736464313

Muli-layer Perceptron MAE: 1.842234816874342

Accuracy: 0.019305019305019305

With LSTM, the R2 value is at 0.95. The MSE at 7.02, RMSE at 2.65 and MAE 2.05 which are all pretty good values.

`LSTM`

LSTM R2: 0.9518996750746772

LSTM MSE: 7.022521341938096

LSTM RMSE: 2.6500040267777134

LSTM MAE: 2.050466546018151

Test loss: 0.0013106020633131266

Test accuracy: 0.0013106020633131266

Accuracy: 0.13%

The first look on the trend chart accredit the numbers. LR and MLP are close to the actual trend whereby LSTM has more variance.

If I going to drill into the chart, the figure denotes, that LR is following the stock trend but has a right shift on the date column. The same is observed with MLP as well.

With LSTM, there is also a shift on the x-axis but additionally, the predictions are following correctly the trend of the actual price.

Let’s also take a look at the correlation scatter plot of LR and LSTM. The predicted/ actual dots of LR are closer to the line. With LSTM, the arrangement of the dots are looking nearly identical.

However, in both ML algortihms, some dots are scattered at the tails of the line, which indicates a higher variance because of the volatility of the stock.

Lets take a look at the following results with higher number of days to predict.

As expected, with all ML algorithms the accuracy drops by the increasing number of days.

14 days out:------------

Linear Regression

Linear Regression R2: 0.9210534027157851

Linear Regression MSE: 11.205277229153344

Linear Regression RMSE: 3.34742845019178

Linear Regression MAE: 2.480697728061077

Accuracy: 0.02131782945736434Muli-layer Perceptron

Muli-layer Perceptron R2: 0.9208279486779142

Muli-layer Perceptron MSE: 11.237277025011299

Muli-layer Perceptron RMSE: 3.3522048005769722

Muli-layer Perceptron MAE: 2.4830014230203714

Accuracy: 0.025193798449612403LSTM

LSTM R2: 0.8293106867363559

LSTM MSE: 24.22677025948721

LSTM RMSE: 4.9220697129853015

LSTM MAE: 3.9718806714789814

Test loss: 0.004521401599049568

Test accuracy: 0.004521401599049568

Accuracy: 0.45%

However, even with a number of 30 days, the R2 of LR and MLP is still at 0.8, whereby LSTM drops to 0.69 but, the error rate of all three metrics have an quadratically increase.

30 days out:------------

Linear Regression

Linear Regression R2: 0.8023495414619239

Linear Regression MSE: 26.06617012652934

Linear Regression RMSE: 5.105503905250621

Linear Regression MAE: 3.840367674747195

Accuracy: 0.014634146341463415Muli-layer Perceptron

Muli-layer Perceptron R2: 0.8056262562367411

Muli-layer Perceptron MSE: 25.634036523560532

Muli-layer Perceptron RMSE: 5.06300666833064

Muli-layer Perceptron MAE: 3.8282281762928236

Accuracy: 0.007804878048780488LSTM

LSTM R2: 0.794942949055266

LSTM MSE: 27.042952569422663

LSTM RMSE: 5.200283893156475

LSTM MAE: 3.942713516533084

Test loss: 0.005046984646469355

Test accuracy: 0.005046984646469355

Accuracy: 0.50%

Now, I want to take a quick look of the results of the Nvidia prediction. From our previous observation of the trend data, we know, that Nvidia is not that volatile than the Daimler stock.

7 days out:

------------

Linear Regression

Linear Regression R2: 0.9840980476812651

Linear Regression MSE: 477.8649860102321

Linear Regression RMSE: 21.860123192933568

Linear Regression MAE: 15.751104514724204

Accuracy: 0.0019305019305019305

Muli-layer Perceptron

Muli-layer Perceptron R2: 0.982952112618063

Muli-layer Perceptron MSE: 512.3011503232476

Muli-layer Perceptron RMSE: 22.634070564599014

Muli-layer Perceptron MAE: 16.427553295187025

Accuracy: 0.0019305019305019305#units=100

#batch=100

#epochs=10

LSTM

LSTM R2: 0.9526400502028002

LSTM MSE: 1423.200201689741

LSTM RMSE: 37.72532573338156

LSTM MAE: 28.309211559516577

Test loss: 0.0021751392632722855

Test accuracy: 0.0021751392632722855

Accuracy: 0.22%#units=100

#batch=4

#epochs=6

LSTM R2: 0.7715236662166166

LSTM MSE: 6865.876457096025

LSTM RMSE: 82.86058446026087

LSTM MAE: 59.134434449773956

Test loss: 0.010493420995771885

Test accuracy: 0.010493420995771885

Accuracy: 1.05%

It seems like that the lower volatility is also reflected by the numbers of R2, which are 0.98 in LR and MLP. Even with LSTM the R2 is at 0.92 for a seven days prediction. With the forecast up to 30 days, there is a decrease of ~0.06 for all ML model. To most difference compared to the first observation with Daimler is, that in all results, MSE, RMSE and MAE are much higher.

With LSTM, i played a bit more. For me, it was an astonishing, it seems like that LSTM somehow performs diffenent with the given parameters when it comes to the volatility of a stock. As we can so from the observation of seven days or 30 days, the results with a batch size of 100 and 10 epochs is much better than with a batch size of and 6 epochs.

30 days out:

------------

Linear Regression

Linear Regression R2: 0.9260988131636365

Linear Regression MSE: 1967.4077523867877

Linear Regression RMSE: 44.355470377246455

Linear Regression MAE: 32.34013531018811

Accuracy: 0.002926829268292683

Muli-layer Perceptron

Muli-layer Perceptron R2: 0.9314364892804224

Muli-layer Perceptron MSE: 1825.3073907897913

Muli-layer Perceptron RMSE: 42.72361631217319

Muli-layer Perceptron MAE: 31.672436833157796

Accuracy: 0.001951219512195122#units=100

#batch=100

#epochs=10

LSTM

LSTM R2: 0.8819063513993067

LSTM MSE: 3143.9056625586086

LSTM RMSE: 56.070541842919695

LSTM MAE: 41.608872088176454

Test loss: 0.006525109056383371

Test accuracy: 0.006525109056383371

Accuracy: 0.65%#units=100

#batch=4

#epochs=6

LSTM R2: 0.6263458145324079

LSTM MSE: 9947.47408899509

LSTM RMSE: 99.73702466484094

LSTM MAE: 72.85733414812786

Test loss: 0.020645776763558388

Test accuracy: 0.020645776763558388

Accuracy: 2.06%

By taking a look onto the next two charts, we can see a high increase of the stock price within the test dataset.

This development could explain why the errors rates have increased that much. But for now, it doesn’r really explain, why LSTM has such different performance with respect to the parameter settings.

# Justification:

All three models seem to perform very effective when it comes to a stock prediction. With the given Daimler stock, all algorithms are facing a very good R2 ratio of ~95% and an average RMSE of ~2.5. Because of that fact and with respect of the situation that I have only used technical indicators as features, the setup of the algorithms seems quite well.

The circumstance that LSTM performs suddenly different by changing the data set also indicates that the model setting are dependent on the given dataset. However, we cannot yet foreseen how a stock suddenly behaving

(for example the rise of the Nvidia stock). Because of the fact and the difficulty to set up a good performing LSTM I would rather tend to use LR or MLP, because of their average robustness.

An additional important factor is time. The training and the prediction wuth LSTM took at least 4 times longer than with LR or MLP. With respect to the result, it is hard to justify why LSTM is the better algorithm.

**Reflection:**

With this project, I have conducted predictions of two different stocks by the use of three different AI models over certain time windows. During the observation of the results, it seems like that with LR and MLP the results were always a bit better than with LSTM. If I also take the time of training and prediction into account, I would tend to use the LR or MLP.

By using two different datasets, I can also make the conclusion that the results are dependent on the volatility of the stock trend. When it comes to LSTM, it seems like, that the volatility of a stock is also important when is comes to the setup of the parameter of the model.

In order to predict a stock price, I have only used the price trend of past days and some statistical number. For that, all results looks well.

However, if we take a very close look to the predicted data for a seven days prediction, we can see that there is a trend shift between actual and the predicted data for exactly seven days. I explain this behavior by the given dataset the algorithms are training with.

It seems like that all algorithms are following the actual stock price given the dataset.

In other words, the results I got, should not be used to make some decisions, if it comes to stock trading. For that, an additional set of features is mandatory. Also switching form a supervised approach to a deep learning ml algorithm could bring some better results. For now, the predictions are too unspecific.

**Improvement:**

Although the performance of each model seemed to be quite impressive,

I have yet not faced a real prediction yet. Thus, a new set of features is necessary. The yfinance API could help here a lot since

it provides a lot more of market data/ metadata and information about the companies that can be used to define new features. By using neuronal networks like MLP and LSTM, I see a lot of potential to improve the results by changing the model parameters.

During my creation of the LSTM model, I could detect a dramatic change of the performance while playing with the number of units or the batch size. Thus, a grid search or another randomized approach could help here a lot.

I hope I could give a good contribution with my project and would be happy to get some feedback.

**Thanks a lot!**

*If you are intersted in my code, you can get it on **GitHub**.*