Asset price Prediction Using Principal Component Analysis And Machine Learning Regression Model
In this post, we are trying to predict tomorrow’s price of a financial asset using a machine learning method and show how we can improve the prediction result by using a feature extraction technique such as principal component analysis (PCA).
What is feature extraction?
feature extraction is the process of
selecting the most relevant and informative features from a dataset to improve
the performance of machine learning models used for financial analysis. Feature
extraction helps in reducing the number of features in the model and creating
new features from the existing attributes. The feature selection process
delivers unique features that contribute the most to the prediction outcomes by
removing noise and irrelevant features.
In finance, several types of feature
extraction techniques are used to identify the most relevant features from a
dataset. Some of the most commonly used feature extraction techniques include:
- Principal
Component Analysis (PCA): A statistical technique that identifies the
most important variables in a dataset and creates a new set of variables that
are linear combinations of the original variables.
- Autoencoder: A
neural network-based technique that learns a compressed representation of the
input data by encoding it into a lower-dimensional space.
- Wavelet
Transform: A mathematical technique that decomposes a signal into
different frequency bands and extracts features from each band².
- Mean
and Standard Deviation Computations: A statistical technique that
calculates the mean and standard deviation of a dataset to extract features.
In asset price forecasting, feature
extraction techniques such as principal component analysis (PCA) and
autoencoder have been successfully applied to identify critical features that
affect the performance of machine learning models and achieve more accurate
stock price predictions.
Python Implementation
In the following Python code, we have
implemented machine learning regression model called Random Forest to predict
the next-day asset price.
First, we need to load all necessary
packages into the Python environment.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error import numpy as np import yfinance as yf from sklearn.decomposition import PCA |
Then, we need to get the asset
price and load it into our python environment. The following function uses
yfinance package to download asset prices from yahoo finance.
# Function to download historical stock data using
yfinance def Get_asset_data(ticker,
start_date, end_date): stock_data =
yf.download(ticker, start=start_date, end=end_date) return stock_data |
The predict_asset_prices function
performs the following steps:
It
splits the asset data into a 70% training set and a 30%
test set.
1. It fits the training data to a Random Forest
regression model and saves the model into the my_model variable.
2. It tests the performance of the model against unseen data.
In this step, it uses the predict attribute in the Sklearn package
to predict using the model with x_test feature values and
saves the predictions into the predictions variable.
3. It calculates the mean squared error (MSE) to
measure the average square difference between actual target values and the
predicted values. The smaller the calculated MSE, the better the fitted model
(although we need to make sure the model is not overfitted).
4. Finally, this function prints the MSE results and returns
the fitted model that will be used for forecasting the next day asset price.
It’s
important to note that we need to create feature and target input data to use
in this function. I will explain how to do it later in this post.
#
ML Function to fetch historical stock data using yfinance def predict_asset_prices(features, target): X_train, X_test, y_train,
y_test = train_test_split(features, target, test_size=0.3,
random_state=42) # Use RandomForestRegressor model model =
RandomForestRegressor(n_estimators=700,
random_state=42) my_model =
model.fit(X_train.values, y_train.values) # Predictions predictions =
model.predict(X_test.values) # Evaluate the model Using Mean Square Error (MSE) mse =
mean_squared_error(np.array(y_test), predictions) print(f'Mean Squared Error: {mse}') return my_model |
This this point lets get the data from
yahoo finance and fit it to RandomForestRegressor model and see how good it predicts the asset price.
For this example, we use Apple stock
price ('AAPL') between '2017-01-01' and '2023-12-25'.
As mentioned earlier we need to create
a target variable series to be able to put into the predict_stock_prices
function. To do that as we need to predict tomorrow’s asset price, we need to
shift the daily close price one day that represents the tomorrow’s price and
use this shifted time series as targets input in the function.
It is important to note that this
shift of price data creates and NA target value (Y dataset) for the last day
data so we need to remove the last day data from the features_data (X dataset)
as well. We have done this using stock_data.iloc[:-1,:].
#
APPLE Example usage ticker_symbol
= 'AAPL' start_date
= '2017-01-01' end_date
= '2023-12-22' stock_data
= Get_stock_data (ticker_symbol, start_date, end_date) #
Assuming Tomorrow 'Close' prices as target variable target_variable
= stock_data['Close'].shift(-1).dropna() #
Predict stock prices using Random Forest Regressor result_1=predict_stock_prices(stock_data.iloc[:-1,:],target_variable) |
After
executing the code, the output result of the function will be:
Mean
Squared Error: 5.512690672097417 |
Now let’s using PCA feature
extraction technique to improve the predictive model.
Again, we first create a function to
reduce dimension of the feature data and then use it to fit into our RandomForestRegressor model.
One important step here is to
standardize data before applying PCA this greatly improves the accuracy and
speed of the code specially when the dataset contains data such as price data
and volume data that are not in the same range. To standardize the data first
we need to calculate mean and standard deviation of the data and save it into
variables. We will need this information to DE standardize the data after we
calculated the predicted result and get the actual price data in forecasting
step. Using sklearn.decomposition PCA
function we fit the standardized data
into the PCA model and return the feature engineered PCA data and mean and std.
N_components represent the number of
featured data we need to engineer. For example, if we have six Open, High, Low,
Close, Adj Close, Volume data column and in our final we want to use 2 feature
engineered data N_components would be 2.
# Function to preprocess data and apply PCA def apply_pca(data,
n_components=1): #
default n_components to 1 #
Feature engineering: Using Close price as a feature features =
data.values.reshape(-1, 6) factors = ([data['Close'].mean(),data['Close'].std()]) #
Standardize data features_standardized =
(features - features.mean()) / features.std()
#
Apply PCA pca =
PCA(n_components=n_components) principal_components =
pca.fit_transform(features_standardized) #
Create DataFrame with principal components pc_df =
pd.DataFrame(data=principal_components, columns=[f'PC{i+1}' for i in range(n_components)]) return (pc_df,factors) |
After creating the apply_pca
it’s time to see how it could improve our Random Forest model.
We have already downloaded the data
let’s use apply_pca function. We set n_components=3 as we want to
extract 3 features for our machine learning model.
After calling the apply_pca we just
need to feed the standardized PCA result into the predict_stock_prices
function. Once again just make sure to remove the last row of data as it is not
available for target time series as we had to shifted one day.
PCA_Result
= apply_pca(stock_data, n_components=3) #
PCA data frame principal_components_df=PCA_Result[0] #
Already shifted target variable ( target_var) target_var
= stock_data['Close'].shift(-1).dropna() #
Now needs to Standardize data target_standardized
= (target_var - target_var.mean()) /
target_var.std() #
Predict stock prices using Random Forest Regressor just need to feed the #
PCA Features into the predict_stock_price function. result
= predict_stock_prices(principal_components_df.iloc[:-1,:],
target_standardized) |
The output result
show great improvement in our model compare to MSE 5.5127 without using PCA.
Mean
Squared Error: 0.0038909149199161473 |
Now that we can enhance the model
using feature engineering techniques let’s predict tomorrow’s asset price of
Apple.
To do that we again need to use the
predict function of the sklearn package and forecast tomorrow’s price based on
the last day data. To get the actual price we need to de standardized the
featured data using the calculated mean and standard deviation.
#
Use the todays featured data to forecast tomorrow's close PCA_prdict
= result.predict([principal_components_df.iloc[-1,:]])
#
Destandardized the result to get actual forecasted price print(PCA_prdict*PCA_Result[1][1] +
PCA_Result[1][0]) |
The next day
Apple price is calculated as follows:
[192.34270398] |
The next step would be to create
a trading strategy based on this and back-test it before uses it in the actual
trading.
Comments
Post a Comment