Overview: Databox's Forecasting Methodology

Overview of Databox's Forecasting Methodology

Databox offers a powerful and robust forecasting feature that empowers you to predict future results for key metrics. This service, developed by Databox's Data Science team, leverages an advanced Facebook Prophet model, which determines forecasts by analyzing historical data and incorporating various other factors.

Because Databox's metric-related data is time-based, forecasting involves predicting how the sequence of observations will evolve in the future. Effective forecasting is critical for organizations engaging in capacity planning and goal setting, as it allows for efficient resource allocation and performance measurement against baseline targets.

Factors that impact Forecasting accuracy

While Databox's forecasting model is highly accurate, several factors can impact its performance. Some of the most important factors that can affect the forecasting accuracy are:

Historical Data: Forecasts are predictions based on past data. The more historical data that is available, the better we can understand trends and patterns and use that to forecast future performance.
Data Quality: The accuracy of the forecast is highly dependent on the quality of the historical data. If data is incomplete, inconsistent, or contains errors, the accuracy of the forecast will be impacted.
Seasonality: Many metrics are subject to seasonal fluctuations, such as lower sales during the holiday season or higher sales at the end of a quarter. The model can account for weekly and yearly seasonality as long as these seasonal trends can be recognized through the historical data.
Data Anomalies: Unusual spikes or dips can affect the accuracy of the forecast. Anomalies could be caused by one-time events, like a server outage leading to a dip in website visits or a promotional event leading to a spike in sales.

The Forecasting Model

Prophet

Databox's Forecasting tool uses Prophet, which is a decomposable time series regression model developed by Facebook's AI research team. This model includes three components: trend, seasonality, and holidays. These components are combined in an additive fashion as described in the following equation:

g(t) is a piecewise linear or logistic growth curve trend. Prophet automatically detects changes in trends by selecting change points from the data.
s(t) is a combination of:
- A yearly seasonal component modeled using the Fourier series. Can be auto-detected if the appropriate setting is enabled.
- A weekly seasonal component using dummy variables.s(t) is a combination of:
h(t) is a user-provided list of important holidays

Pro Tip: User-provided lists of important holidays, recognized by h(t) in the equation, is not currently included in Databox's forecasting model.

The key concept in Prophet is to accurately fit the trend component by using a flexible regression model to help improve the precision and accuracy of the forecast. This approach allows for greater modeling flexibility, easier model fitting, and better handling of missing data or outliers than traditional time series models.

With Prophet, we're able to model nonlinear saturating growth by specifying a minimum and maximum population value, and for forecasting problems that don't exhibit saturating growth we can use a piecewise constant rate of growth that provides an economic model.

Prophet's ability to perform well on a diverse range of forecasting problems and its efficient prediction generation process make it an ideal choice for large-scale forecasting projects.

Although Databox is using the Prophet model developed by Facebook's team, the Metric Forecast feature in Databox is independently developed and run. All of the data needed for forecasting is processed by Databox and is not sent or shared with any third party.

How does Databox's Forecasting Model work

Data Collection

To generate forecasts, our service uses a metric data configuration ID to retrieve data needed for forecasting. The source of this data depends on the storage solution used. We obtain the required data directly from the current warehouse. As a result, our model-building and benchmarking processes use the same database as our production systems, which includes Databox's Analytics app and Benchmarks app.

Data Preparation

To build a time series forecasting model, we need to have a variable that we want to forecast, indexed by a specific date. Before fitting the model, the input data is adjusted to the format required by the model fitting function. This results in superior performance compared to using the data as is.

The most important aspect of this is the handling of anomalies. Because, by definition, anomalies are not predictable, they can bias forecasts and inflate the estimated variance. To account for this, the prediction intervals are wider than they would be otherwise.

Trend Modelling

Our forecasting model uses a piecewise linear approach to capture changes in the underlying trend of the time series. This method involves analyzing how the data changes over time and identifying points where the trend shifts. The model then fits a linear regression to each segment, resulting in a more precise forecast.

Seasonality Modelling

The model uses the Fourier series to capture seasonal patterns in the data. This involves identifying the main seasonal periods in the data (e.g., weekly, monthly, yearly) and fitting a series of sinusoidal functions to capture those patterns. By doing this, the model makes more accurate predictions about how the data will change over time.

In the case of daily granularity, weekly seasonality is enabled if there are at least two weeks' worth of data points. Yearly seasonality is enabled for the following time spans:

Minimum time series span for types of seasonalities for yearly seasonality

Time Series Granularity	Minimum time series range (in days)
Daily	729
Weekly	722
Monthly	699

Forecast

Once the model is trained, it can be used to forecast future values of the time series. The model generates a point forecast for each future time point, as well as upper and lower bounds for the uncertainty intervals. This allows users to see how the data is likely to change over time and how certain the model is about those predictions.

The model is designed to generate uncertainty intervals around its point forecasts. It does this using a Bayesian framework, which is a mathematical technique that allows the model to estimate how certain it is about its predictions. The uncertainty intervals give users an idea of how much variation there might be in the data and how confident the model is about its forecasts.

The main forecast line is the point forecast generated by the time series forecasting model, which provides the best estimate of future values based on historical data. The upper and lower forecasted values correspond to the confidence interval around the point forecast and are set at an 80% confidence level. The confidence interval represents the range within which the true future value is expected to fall with an 80% probability.

The width of the confidence interval may vary based on the forecasted period, which is determined by the model's time horizon. As the forecast horizon increases, the uncertainty around the forecast may increase, resulting in wider confidence intervals.