Asset price Prediction Using Principal Component Analysis And Machine Learning Regression Model




In this post, we are trying to predict tomorrow’s price of a financial asset using a machine learning method and show how we can improve the prediction result by using a feature extraction technique such as principal component analysis (PCA).

What is feature extraction?

feature extraction is the process of selecting the most relevant and informative features from a dataset to improve the performance of machine learning models used for financial analysis. Feature extraction helps in reducing the number of features in the model and creating new features from the existing attributes. The feature selection process delivers unique features that contribute the most to the prediction outcomes by removing noise and irrelevant features.

In finance, several types of feature extraction techniques are used to identify the most relevant features from a dataset. Some of the most commonly used feature extraction techniques include:

-             Principal Component Analysis (PCA): A statistical technique that identifies the most important variables in a dataset and creates a new set of variables that are linear combinations of the original variables.

-             Autoencoder: A neural network-based technique that learns a compressed representation of the input data by encoding it into a lower-dimensional space.

-             Wavelet Transform: A mathematical technique that decomposes a signal into different frequency bands and extracts features from each band².

-             Mean and Standard Deviation Computations: A statistical technique that calculates the mean and standard deviation of a dataset to extract features.

In asset price forecasting, feature extraction techniques such as principal component analysis (PCA) and autoencoder have been successfully applied to identify critical features that affect the performance of machine learning models and achieve more accurate stock price predictions.

Python Implementation

In the following Python code, we have implemented machine learning regression model called Random Forest to predict the next-day asset price.

First, we need to load all necessary packages into the Python environment.

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

import numpy as np

import yfinance as yf

from sklearn.decomposition import PCA

 Then, we need to get the asset price and load it into our python environment. The following function uses yfinance package to download asset prices from yahoo finance.

# Function to download historical stock data using yfinance

def Get_asset_data(ticker, start_date, end_date):

    stock_data = yf.download(ticker, start=start_date, end=end_date)

    return stock_data 

 The predict_asset_prices function performs the following steps:

It splits the asset data into a 70% training set and a 30% test set.

1.     It fits the training data to a Random Forest regression model and saves the model into the my_model variable.

2.     It tests the performance of the model against unseen data. In this step, it uses the predict attribute in the Sklearn package to predict using the model with x_test feature values and saves the predictions into the predictions variable.

3.     It calculates the mean squared error (MSE) to measure the average square difference between actual target values and the predicted values. The smaller the calculated MSE, the better the fitted model (although we need to make sure the model is not overfitted).

4.     Finally, this function prints the MSE results and returns the fitted model that will be used for forecasting the next day asset price.

It’s important to note that we need to create feature and target input data to use in this function. I will explain how to do it later in this post.

 # ML Function to fetch historical stock data using yfinance

def predict_asset_prices(features, target):

    X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

    # Use RandomForestRegressor model

    model = RandomForestRegressor(n_estimators=700, random_state=42)

    my_model = model.fit(X_train.values, y_train.values)

    # Predictions

    predictions = model.predict(X_test.values)

    # Evaluate the model Using Mean Square Error (MSE)

    mse = mean_squared_error(np.array(y_test), predictions)

    print(f'Mean Squared Error: {mse}')

    return my_model

This this point lets get the data from yahoo finance and fit it to RandomForestRegressor model and see how good it predicts the asset price.

For this example, we use Apple stock price ('AAPL') between '2017-01-01' and '2023-12-25'.

As mentioned earlier we need to create a target variable series to be able to put into the predict_stock_prices function. To do that as we need to predict tomorrow’s asset price, we need to shift the daily close price one day that represents the tomorrow’s price and use this shifted time series as targets input in the function.

It is important to note that this shift of price data creates and NA target value (Y dataset) for the last day data so we need to remove the last day data from the features_data (X dataset) as well. We have done this using stock_data.iloc[:-1,:]. 

# APPLE Example usage

ticker_symbol = 'AAPL'

start_date = '2017-01-01'

end_date = '2023-12-22' 

stock_data = Get_stock_data (ticker_symbol, start_date, end_date)

# Assuming Tomorrow 'Close' prices as target variable

target_variable = stock_data['Close'].shift(-1).dropna()

 

# Predict stock prices using Random Forest Regressor

result_1=predict_stock_prices(stock_data.iloc[:-1,:],target_variable) 

  After executing the code, the output result of the function will be:

Mean Squared Error: 5.512690672097417

 Now let’s using PCA feature extraction technique to improve the predictive model.

Again, we first create a function to reduce dimension of the feature data and then use it to fit into our RandomForestRegressor model.

One important step here is to standardize data before applying PCA this greatly improves the accuracy and speed of the code specially when the dataset contains data such as price data and volume data that are not in the same range. To standardize the data first we need to calculate mean and standard deviation of the data and save it into variables. We will need this information to DE standardize the data after we calculated the predicted result and get the actual price data in forecasting step. Using sklearn.decomposition PCA function we fit the standardized data into the PCA model and return the feature engineered PCA data and mean and std.

N_components represent the number of featured data we need to engineer. For example, if we have six Open, High, Low, Close, Adj Close, Volume data column and in our final we want to use 2 feature engineered data N_components would be 2.

# Function to preprocess data and apply PCA

def apply_pca(data, n_components=1):  # default n_components to 1

    # Feature engineering: Using Close price as a feature

    features = data.values.reshape(-16)

    factors = ([data['Close'].mean(),data['Close'].std()])

    # Standardize data

    features_standardized = (features - features.mean()) / features.std()   

    # Apply PCA

    pca = PCA(n_components=n_components)

    principal_components = pca.fit_transform(features_standardized) 

    # Create DataFrame with principal components

    pc_df = pd.DataFrame(data=principal_components, columns=[f'PC{i+1}' for i in range(n_components)])   

    return (pc_df,factors)

 

 After creating the apply_pca it’s time to see how it could improve our Random Forest model.

We have already downloaded the data let’s use apply_pca function. We set n_components=3 as we want to extract 3 features for our machine learning model.

After calling the apply_pca we just need to feed the standardized PCA result into the predict_stock_prices function. Once again just make sure to remove the last row of data as it is not available for target time series as we had to shifted one day.

PCA_Result = apply_pca(stock_data, n_components=3)

# PCA data frame

principal_components_df=PCA_Result[0]

 

# Already shifted target variable ( target_var)

target_var = stock_data['Close'].shift(-1).dropna()

 

# Now needs to Standardize data

target_standardized = (target_var - target_var.mean()) / target_var.std()

 

# Predict stock prices using Random Forest Regressor just need to feed the

# PCA Features into the predict_stock_price function.

result = predict_stock_prices(principal_components_df.iloc[:-1,:], target_standardized)

 

 The output result show great improvement in our model compare to MSE 5.5127 without using PCA.

Mean Squared Error: 0.0038909149199161473

Now that we can enhance the model using feature engineering techniques let’s predict tomorrow’s asset price of Apple.

To do that we again need to use the predict function of the sklearn package and forecast tomorrow’s price based on the last day data. To get the actual price we need to de standardized the featured data using the calculated mean and standard deviation.

# Use the todays featured data to forecast tomorrow's close

PCA_prdict = result.predict([principal_components_df.iloc[-1,:]])

# Destandardized the result to get actual forecasted price

print(PCA_prdict*PCA_Result[1][1] + PCA_Result[1][0])

 The next day Apple price is calculated as follows:

[192.34270398]

 The next step would be to create a trading strategy based on this and back-test it before uses it in the actual trading.


Comments

Popular posts from this blog

Application of GARCH models in R – Part II ( APARCH)

How to check if a distribution is normal?