Predicting Stock Market in Tensorflow, Awesome Result!
Hi Everyone!
Recently I have been into stock market like A LOT (the smell of newbie is too strong). I have been reading some book about finance and stock investing as well as trading. Since I believe it would take sometime, why don’t I let machine do it for me for a while.
So I search on the stock prediction macine learning and there were thousand of it in the internet (of course!). I chose one of the best post and it was this post by Aishwarya Singh from 2018! You should check his post too!
After a classy copy paste and a bit of tweak here is my version:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import datetime as dt
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
import math
import os
import psycopg2
from config import config_prod
The libraries you need (or not).
#Extracting Data
code = 'TLKM.JK'
sql = '''select
t.transaction_date,
t.transaction_open,
t.transaction_high,
t.transaction_low,
t.transaction_close,
t.transaction_adj_close,
t.transaction_volume
from trading.countries as c
inner join trading.stocks as s on c.id = s.country_id
inner join trading.daily_transaction t on s.id = t.stock_id
where c.id = 1 and t.is_deleted = false and s.code like %s '''
conn = None
result = []
try:
params = config_prod()
conn = psycopg2.connect(**params)
cur = conn.cursor()
cur.execute(sql, (code,))
conn.commit()
result = cur.fetchall()
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()df = pd.DataFrame(result, columns = ['date', 'open', 'high', 'low', 'close', 'adj_close', 'volume'])df.index = df.date
df = df.sort_index(ascending = True, axis = 0)df['date'] = pd.to_datetime(df.date, format = '%Y-%m-%d')
So I started by querying data from my database. I have made scheduler to insert daily data of most of the Indonesian Stocks into my Database (I will create the post later!). I also reformat the date and set the index into transaction date since our model will be a time-bound regression.
The data looks like above pic.
new_data = pd.DataFrame(index=range(0, len(df)), columns = ['date', 'close'])
for i in range(0, len(df)):
new_data['date'][i] = df['date'][i]
new_data['close'][i] = df['close'][i]
After having the data frame ready, I prepare new variable to play with: new_data. Since the model will learn from close value of the stock, the columns I extract from the dataframe are transaction date and transaction close value.
The new_data contain these values:
new_data.index = new_data.date
new_data.drop('date', axis=1, inplace = True)
I also did the same to my new dataset, setting the date as index and drop it in a column.
So the new dataset looks like above pic.
dataset = new_data.values.astype(float)
The original post I followed was using txt as a data source. The problem arises when I use postgresql query as a source. Without astype(float), the array looks like this:
In further explanation I will divide the dataset into training set and validation set. The training set will go through several processes that result in removing the Decimal() tag on the array value. However, it does not happen to the validation set. The result is error on the model validation in the end of the process that says:
unsupported operand type(s) for -: 'decimal.Decimal' and 'float'
After having several surfs on the internet, I found that astype(float) fix this issue.
threshold = 0.8
train = dataset[:math.ceil(len(df) * threshold), :]
valid = dataset[math.ceil(len(df) * threshold):, :]
I divided the traning and validation data set into 80:20.
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(dataset)
I also scale the data into 0 to 1. This also becomes my concern while thinking about the deployment. My question is what if there is a value greater that 1 (the current maximum of transaction close on that particular stock) in the upcoming data. My curiosity lead me into the trial and the result it that the MinMaxScaler model will return 0 when it does not have the input within its range of observation. In other words, both MinMaxScaler and Selection model need to rerun every time there is a new input to have an effective result.
pred = 60
x_train, y_train = [], []
for i in range(pred,len(train)):
x_train.append(scaled_data[i-pred:i,0])
y_train.append(scaled_data[i,0])
x_train, y_train = np.array(x_train), np.array(y_train)
Here is another difficult part. First, The 60 on variable pred here is customizable. So, in order to predict n, the model needs the value of n-61 to n-1. If I want to know the price of tomorrow’s market, I need to feed the model with data of close value of a stock within the last 60 days.
The variable x_train will have the data of close value within past 60 days and the variable y_train will be the predicted 61st.
Notice that the first iteration of the loop is the value of pred, or the number of training data for each prediction. The reason behind it is because the model need at least 60 data (in this case) as a baseline.
x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1))
The line of code above is to make sure that the training data has the intended shape.
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(x_train.shape[1],1)))
model.add(LSTM(units=50))
model.add(Dense(1))model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x_train, y_train, epochs=1, batch_size=1, verbose=2)
As For the Model settings, I do not change a bit from original post. In addition, I also new to the tensorflow, so let’s just leave it as it is for now.
inputs = new_data[len(new_data) - len(valid) - pred:].values
inputs = inputs.reshape(-1,1)
inputs = scaler.transform(inputs)X_test = []
for i in range(pred,inputs.shape[0]):
X_test.append(inputs[i-pred:i,0])
X_test = np.array(X_test)
So now the testing test is veing prepared. The concept is actually the same as preparing training set. There are 60 predetermined value of data and 1 prediction. So the 20% of dataset portion is extracted into the X_test by using the rule above resulting in 781 indices of 60 array of stock closing value.
X_test = np.reshape(X_test, (X_test.shape[0],X_test.shape[1],1))
Like the training set, the shape of test set is also fixed by the reshape function from numpy.
closing_price = model.predict(X_test)
closing_price = scaler.inverse_transform(closing_price)
Afterwards, it goes directly into the model to be predicted.
rms=np.sqrt(np.mean(np.power((valid-closing_price),2)))
print(rms)
The moment of truth. The model is validated using Root Mean Square method and the result is quite statisfying, considering leaving the model settings as it is. The results is as low as 324.38.
Here is some graphics to describe how good is the model:
Thank you for reaching the end of the post. Appreciate it so much.
Please let me know if you make your better version of stock prediction. I am looking forward to it.