As promised in our previous blog “Introduction to Time Series Forecasting in R,” we are back with our next installment to build a time series model.
Time series modeling is a low-cost solution that uses historical data to spot and analyze patterns to provide powerful insights, helping predict future outcomes.
This post will walk through the steps to build the time series-forecasting model, starting from scratch, but first, let us understand the different phases of forecasting analysis.
ARIMA Forecasting Procedure
Before building the Time Series Model, install the below packages in R.
Stepwise Approach to build a Time Series Model
Let us begin building a time series model, step-by-step
1) Convert the Dataset into Time Series
The ts() function in R is used to convert the numeric column of dataset into time series object. For example:
T_Data <- ts(Data$Count, start = c(2014,1), end = c(2017,5), frequency = 12)
Here is the details of all the parameters of the ts function:
- Start– time from which the data series starts.
- End– time at which the data series ends.
- Frequency– Number of observations per unit time. (1 = annual, 4 = quarterly, 12 = monthly, etc.)
2) Analyze the Time Series Object
Decompose the time series into various components.
The decompose() function in R is used to decompose the series into trend, seasonality, and random. We can visualize the decomposed parts by using the plot() function.
For example: plot(decompose(T_Data), xlab= “Year(t)”)
3) Stationarize a Time Series
In order to build a time series model, we first have to make the series stationary. A stationary series is one whose mean and variance is constant over time. To stationarize a time series we need to perform below steps:
- Series Transformation: we can transform the series using boxcox() function. The function stabilizes the variance in the series. Lambda(l) is the parameter of boxcox() function and its value decides what kind of transformation is required for the series.
lambda <- BoxCox.lambda(T_Data)
plot.ts(BoxCox(T_Data, lambda = lambda))
T_Data_box <- BoxCox(T_Data, lambda = lambda)
- Differencing: differencing helps in stabilizing the mean.
T_Data_box _d <- diff(T_Data_box)
4) Check Stationarity of Time Series
We need to perform augmented Dickey-Fuller test to check the stationarity of newly transformed series.
Adf.test() is the function under fUnitRoots package to perform the stationarity check.
Here is the output of the test.
If the p-value is less than or equal to 0.05, we accept the alternative hypothesis i.e. series is stationary.
5) Create ACF and PACF Plots
ACF (Auto Correlation Function) and PACF (Partial Auto Correlation Function) are very useful in time series as they help us decide the appropriate time series model for forecasting.
- ACF(q): ACF is a set of correlation coefficients between a time series and lags of itself over time.
- PACF(p): PACF is a set of partial correlation coefficients between the series and lags of itself that even the correlations at the lower-order-lags cannot explain. Let us understand this with an example: while regressing a variable, say, Y on other variables, say, X1, X2, and X3, the PCAF between the variables Y and X3 is the amount of correlation between these variables that is not explained by their common correlations with the variables X1 and X2.
Following are the functions to calculate ACF and PACF values in R:
Autoregressive (AR) Models
The PACF at all lags can be calculated by substituting a succession of AR models with increasing numbers of lags.
An autoregressive model of order ‘p’, AR(p), is as follows:
This means the current value of Xt can be calculated using the past values and adding a random et to it.
It is similar to the multiple regression model but Xt is regressed on its past values.
The AR(1) Model
A simple way to model dependence over time is with the autoregressive model of order 1, which is an OLS model of Xt regressed on lagged Xt-1.
Let us see what the model says for the t+1 observation.
The AR(1) model expresses what we don’t know in terms of what we do know at the time t.
Identifying an AR Process
The autocorrelations of a pure AR(p) process should decay gradually with an increase in lag length. The partial autocorrelations of a pure AR(p) process, however, do display distinctive features. The partial autocorrelations, thus, should ‘die out’ after p lags.
Moving-average (MA) Models
A moving-average model of order ‘q’, MA(q), is as follows:
We can calculate the current value of Xt using the past shocks/error (e) and adding a new shock/error (et).
The time series is regarded as a moving average (unevenly weighted, because of different coefficients) of a random shock series et.
The MA(1) Model
The first order moving-average model would look like:
- If the value of b1 is zero, X depends purely on the error/shock (e) at the current time, and there is no temporary dependence.
- If the value of b1 is large, previous errors influence the value of Xt.
If the model successfully captures the dependence structure in the data, the residuals should look random.
Identifying an MA Process
The behavior of autocorrelations and partial autocorrelations for pure MA(q) processes is the reverse of that for pure AR (p) processes. The autocorrelations of a pure MA(q) process should ‘die out’ after q lags, while the partial autocorrelations of a pure MA(q) process decays slowly over time.
General Theoretical ACF and PACF of ARIMA Models
6) Choose a Time Series Model
- ARMA models are suitable for time series where mean and variance are constant.
- The ARMA model consists of two parts, an autoregressive (AR) part and a moving average (MA) part.
- In the ARMA(p,q) model:
- p is the order of the autoregressive part.
- q is the order of the moving average part.
ARIMA or Autoregressive Integrated Moving Average model is the generalization of the ARMA model and uses non-stationary data to predict future points in the series. ARIMA model is highly useful for series with stochastic trends first order or ‘simple’ differencing.
The ‘I’ in ARIMA stands for integrated, which means we are differencing the series.
Typical notation of ARIMA Model is ARIMA(p, d, q) where:
- p is the order of autoregressive terms .
- d is the order of differencing.
- q is the number of moving average terms.
For instance: ARIMA(1,1,0) is a first order Autoregressive model with one order of differencing.
How to use ARIMA(p,d,q)?
- Plot the data.
- Decompose the data into trend, seasonality and randomness.
- Check whether data is stationary or not.
- Plot the ACF and PACF charts (stationarity is implied by the ACF or PACF dropping quickly to zero).
- If there is non-stationarity, such as a trend (ignoring seasonal behavior), we need to apply differencing. Practically, at most two differences need to be taken to reduce a series to stationary.
- Verify stationarity by augmented Dickey-Fuller test.
- After obtaining the stationary data, analyze the ACF and PACF for the remaining pattern. Also, verify the theoretical behavior of the MA and AR models to see if they fit. You might build an ARIMA model with either no MA or AR component i.e. ARIMA(0,d,q) or ARIMA(p,d,0).
Seasonal ARIMA Model
The season ARIMA model is used when trend and seasonality both are present. The seasonal ARIMA model looks for the best explanatory variables to model a seasonal pattern.
ACF and PACF are used to identify seasonal components P or Q:
- For ARIMA(0, 0, 0)(P, 0,0)s , you should see major peaks on the PACF at s, 2s, ….Ps. The coefficients on the ACF at lags s, 2s, ….Ps, … should form an exponential decrease or a damped sine wave.
- ARIMA(0, 0,0)(0,0,Q)s , you should see major peaks on the ACF at s, 2s, ….Qs. The coefficients on the PACF at lags s, 2s, ….Qs,… should form an exponential decrease, or a damped sine wave.
7) Build the ARIMA Model Using Different Combinations of ACF and PACF Values
Here is the R code:
fit <- arima(T_Data_box, c(4, 1, 1),seasonal = list(order = c(1, 1, 1), period = 12) )
Here is the R output:
Pick the model with least AIC value, where AIC stands for Akaike Information Criterion, which is a function to determine the best model. Lower the AIC is better the model.
8) Forecast the Data as per the Above Model
Here is the R code to forecast data for next 12 months:
pred_fit <- predict(fit, n.ahead = 1*12)
9) Merge Actual and Forecast in One Series
comb <- ts.union(T_Data_box, pred_fit$pred)
final_t <-pmin(comb[,1], comb[,2], na.rm = TRUE)
10) Remove Transformation from Series
final_forecast <- exp(log(lambda * final_t + 1) / lambda)
Now you have the data series with actual numbers and forecast for next 12 months.
A key take away from these steps is a good clarity on ACF and PACF, which would be the deciding factor for what kind of forecasting technique is applied on dataset. It will also help ensure that time series is stationary before running a forecasting technique over it.
We hope this blog is helpful and you will be able to build a time series model for your requirement. You may want to take different kinds of forecasting problem and try to implement the above-mentioned techniques to gain a better understanding.
Let us know how helpful was this blog in building the time series model and the results of your forecasting in the comments below.
That is all from us.
This is also the end of the series. It was great experience bringing you the information and sharing & gaining knowledge.
Did you miss the first part? – https://techblog.xavient.com/introduction-to-time-series-forecasting-in-r/
Until next time!
- Introduction to Time Series Forecasting in R
Time is the most critical aspect of a business. It has the power to make or break any business, making proper utilization of time all the more crucial. It is,…
- KAFKA-Druid Integration with Ingestion DIP Real Time Data
The following blog explains how we can leverage the power of Druid to ingest the DIP data into Druid (a high performance, column oriented, distributed data store), via Kafka Tranquility…
- HDFS on Mesos Installation
HDFS on Mesos Installation Mesos cluster optimize the resources and bring the whole data-center at one platform where all the resources can be managed efficiently. Setting up mesos cluster with…
- Hadoop Cluster Verification (HCV)
Verification scripts basically composed of idea to run a smoke test against any Hadoop component using shell script. HCV is a set of artifacts developed to verify successful implementation of…
- Understanding Chain of Responsibility Pattern
The chain of responsibility helps to create a chain of receiver objects to complete a request. This pattern decouples the sender and receiver of every request based on the style…
- Column Store Index in SQL Server 2012
This post is about the new feature, i.e., Column Store Index which is available since SQL 2012 version. Microsoft has released column store index to improve the performance by 10x.…