4. ARMA modeling
The AR and MA models are quite simple. To put it simply, Markdown is too tired to write formulas. If you are interested, please check them yourself. AR (Autoregression) model is mainly used to model time series. If the series has passed the ACF test, that is, the autocorrelation coefficient with an interval of 1 is significant, that is, the data at time may be useful for predicting time t.
MA (Moving Average) model uses the random interference or error prediction of the past q periods to express the current prediction value linearly.
In order to fully describe the dynamic structure of the data, it is necessary to increase the order of AR or MA models, but such parameters will make the calculation more complex. Therefore, in order to simplify this process, an autoregressive moving average (ARMA) model is proposed.
Since price time series are generally non-stationary, and the optimization effect of difference method on stationarity has been discussed previously, ARIMA (p, d, q) (sum autoregressive moving average) model adds d-order difference processing on the basis of the application of existing models to series. However, since we have used logarithms, we can use ARMA (p, q) directly.
In a word, the only difference between ARIMA model and ARMA model building process is that if the unstable results are obtained after analyzing the stationarity, the model will make a quadratic difference to the series directly and then perform the stationarity test, and then determine the order p and q until the series is stable. After building the model and evaluating it, the subsequent prediction will be made, eliminating the step of going back to do the difference. However, the second-order difference of price is meaningless, so ARMA is the best choice.
4-1. Order selection
Next, we can select the order directly by the information criterion, here we try it with the thermodynamic diagrams of AIC and BIC.
In [10]:
def select_best_params():
ps = range(0, 4)
ds= range(1, 2)
qs = range(0, 4)
parameters = product(ps, ds, qs)
parameters_list = list(parameters)
p_min = 0
d_min = 0
q_min = 0
p_max = 3
d_max = 3
q_max = 3
results_aic = pd.DataFrame(index=['AR{}'.format(i) for i in range(p_min,p_max+1)],
columns=['MA{}'.format(i) for i in range(q_min,q_max+1)])
results_bic = pd.DataFrame(index=['AR{}'.format(i) for i in range(p_min,p_max+1)],
columns=['MA{}'.format(i) for i in range(q_min,q_max+1)])
best_params = []
aic_results = []
bic_results = []
hqic_results = []
best_aic = float("inf")
best_bic = float("inf")
best_hqic = float("inf")
warnings.filterwarnings('ignore')
for param in parameters_list:
try:
model = sm.tsa.SARIMAX(kline_all['log_price'], order=(param[0], param[1], param[2])).fit(disp=-1)
results_aic.loc['AR{}'.format(param[0]), 'MA{}'.format(param[2])] = model.aic
results_bic.loc['AR{}'.format(param[0]), 'MA{}'.format(param[2])] = model.bic
except ValueError:
continue
aic_results.append([param, model.aic])
bic_results.append([param, model.bic])
hqic_results.append([param, model.hqic])
results_aic = results_aic[results_aic.columns].astype(float)
results_bic = results_bic[results_bic.columns].astype(float)
# Draw thermodynamic diagrams of AIC and BIC to find the best
fig = plt.figure(figsize=(18, 9))
layout = (2, 2)
aic_ax = plt.subplot2grid(layout, (0, 0))
bic_ax = plt.subplot2grid(layout, (0, 1))
aic_ax = sns.heatmap(results_aic,mask=results_aic.isnull(),ax=aic_ax,cmap='OrRd',annot=True,fmt='.2f',);
aic_ax.set_title('AIC');
bic_ax = sns.heatmap(results_bic,mask=results_bic.isnull(),ax=bic_ax,cmap='OrRd',annot=True,fmt='.2f',);
bic_ax.set_title('BIC');
aic_df = pd.DataFrame(aic_results)
aic_df.columns = ['params', 'aic']
best_params.append(aic_df.params[aic_df.aic.idxmin()])
print('AIC best param: {}'.format(aic_df.params[aic_df.aic.idxmin()]))
bic_df = pd.DataFrame(bic_results)
bic_df.columns = ['params', 'bic']
best_params.append(bic_df.params[bic_df.bic.idxmin()])
print('BIC best param: {}'.format(bic_df.params[bic_df.bic.idxmin()]))
hqic_df = pd.DataFrame(hqic_results)
hqic_df.columns = ['params', 'hqic']
best_params.append(hqic_df.params[hqic_df.hqic.idxmin()])
print('HQIC best param: {}'.format(hqic_df.params[hqic_df.hqic.idxmin()]))
for best_param in best_params:
if best_params.count(best_param)>=2:
print('Best Param Selected: {}'.format(best_param))
return best_param
best_param = select_best_params()
Out[10]:
AIC best param: (0, 1, 1)
BIC best param: (0, 1, 1)
HQIC best param: (0, 1, 1)
Best Param Selected: (0, 1, 1)
It is obvious that the optimal first-order parameter combination for logarithmic price is (0,1,1), which is simple and straightforward. The log_return (logarithmic rate of return) performs the same operation. The AIC optimal value is (4,3), and the BIC optimal value is (0,1). So the optimal combination of parameters for log_return (logarithmic rate of return) is (0,1).
4-2. ARMA modeling and matching
Quarterly coefficients are not required, but SARIMAX is richer in attributes, so it was decided to choose this model for modeling and incidentally draw a descriptive analysis as follows:
In [11]:
params = (0, 0, 1)
training_model = smt.SARIMAX(endog=kline_all['log_return'], trend='c', order=params, seasonal_order=(0, 0, 0, 0))
model_results = training_model.fit(disp=False)
model_results.summary()
Out[11]:
Statespace Model Results
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [27]:
model_results.plot_diagnostics(figsize=(18, 10));
Out[27]:
The probability density KDE in the histogram is far from the normal distribution N (0,1), indicating that the residual is not a normal distribution. In the QQ quantile plot, the residuals of the samples sampled from the standard normal distribution do not fully follow the linear trend, so it is confirmed again that the residuals are not normal distribution and are closer to white noise.
Then, having said that, whether the model can be used still needs to be tested.
4-3. Model test
The matching effect of the residual is not ideal, so we performed the Durbin Watson test on it. The original hypothesis of the test is that the sequence does not have autocorrelation, and the alternative hypothesis sequence is stationary. In addition, if the P values of LB, JB and H tests are less than the critical value of 0.05% confidence level, the original hypothesis will be rejected.
In [12]:
het_method='breakvar'
norm_method='jarquebera'
sercor_method='ljungbox'
(het_stat, het_p) = model_results.test_heteroskedasticity(het_method)[0]
norm_stat, norm_p, skew, kurtosis = model_results.test_normality(norm_method)[0]
sercor_stat, sercor_p = model_results.test_serial_correlation(method=sercor_method)[0]
sercor_stat = sercor_stat[-1] # The last value of the maximum period
sercor_p = sercor_p[-1]
dw = sm.stats.stattools.durbin_watson(model_results.filter_results.standardized_forecasts_error[0, model_results.loglikelihood_burn:])
arroots_outside_unit_circle = np.all(np.abs(model_results.arroots) > 1)
maroots_outside_unit_circle = np.all(np.abs(model_results.maroots) > 1)
print('Test heteroskedasticity of residuals ({}): stat={:.3f}, p={:.3f}'.format(het_method, het_stat, het_p));
print('\nTest normality of residuals ({}): stat={:.3f}, p={:.3f}'.format(norm_method, norm_stat, norm_p));
print('\nTest serial correlation of residuals ({}): stat={:.3f}, p={:.3f}'.format(sercor_method, sercor_stat, sercor_p));
print('\nDurbin-Watson test on residuals: d={:.2f}\n\t(NB: 2 means no serial correlation, 0=pos, 4=neg)'.format(dw))
print('\nTest for all AR roots outside unit circle (>1): {}'.format(arroots_outside_unit_circle))
print('\nTest for all MA roots outside unit circle (>1): {}'.format(maroots_outside_unit_circle))
root_test=pd.DataFrame(index=['Durbin-Watson test on residuals','Test for all AR roots outside unit circle (>1)','Test for all MA roots outside unit circle (>1)'],columns=['c'])
root_test['c']['Durbin-Watson test on residuals']=dw
root_test['c']['Test for all AR roots outside unit circle (>1)']=arroots_outside_unit_circle
root_test['c']['Test for all MA roots outside unit circle (>1)']=maroots_outside_unit_circle
root_test
Out[12]:
Test heteroskedasticity of residuals (breakvar): stat=24.598, p=0.000
Test normality of residuals (jarquebera): stat=106398.739, p=0.000
Test serial correlation of residuals (ljungbox): stat=104.767, p=0.000
Durbin-Watson test on residuals: d=2.00
(NB: 2 means no serial correlation, 0=pos, 4=neg)
Test for all AR roots outside unit circle (>1): True
Test for all MA roots outside unit circle (>1): True
In [13]:
kline_all['log_price_dif1'] = kline_all['log_price'].diff(1)
kline_all = kline_all[1:]
kline_train = kline_all
training_label = 'log_return'
training_ts = pd.DataFrame(kline_train[training_label], dtype=np.float)
delta = model_results.fittedvalues - training_ts[training_label]
adjR = 1 - delta.var()/training_ts[training_label].var()
adjR_test=pd.DataFrame(index=['adjR2'],columns=['Value'])
adjR_test['Value']['adjR2']=adjR**2
adjR_test
Out[13]:
If the Durbin Watson test statistic is equal to 2, confirms that the series has no correlation, and its statistical value is distributed between (0,4). Close to 0 means that the positive correlation is high, while close to 4 means that the negative correlation is high. Here it is approximately equal to 2. The P value of other tests is small enough, the unit characteristic root is outside the unit circle, and the larger the value of the modified adjR², the better it will be. The overall result of the measure does not seem to be satisfactory.
In [14]:
model_results.params
Out[14]:
intercept -0.000817
ma.L1 -0.102102
sigma2 0.000275
dtype: float64
To sum up, this order setting parameter can basically meet the requirements of modeling time series and subsequent volatility modeling, but the matching effect is so-so. The model expression is as follows:
To be continued...