CRDSS_Task10-2_EvaluateExtensionHistoricalData

Home Browse Search

correlated with the dependent series) is chosen by one of several techniques usually provided. One regression equation will be used to fill in values along the time series as long as all of the independent series are concurrent. When a gap in one of these series is reached, the equation must be reevaluated to include a different series with a concurrent record at this time step. A brief review of available statistical packages showed that multiple regression is the most common data filling and extension technique employed. Regression, and on a smaller scale data filling and extension, is just one of numerous statistical computation and analysis tools provided in these packages. Most include at the very least graphing capabilities, hundreds of statistical computations, and numerous analysis tools and techniques. The packages reviewed included Statisitca, SPSS 8.0, StatView v4.5, DataDesk, InStat, Statgraphics, S-Plus, Minitab, NCSS, and SimStat. Generally, the capabilities of these packages far exceed the needs of the CRDSS project, although they could be reasonably employed as part of the system. A disadvantage with the use of regression for data filling and extension is its tendency to cause a variance reduction in the new data. There are ways to combat this, including adding noise to the model, or using variance maintenance as the constraint in the regression model instead of minimum error. The latter method is referred to as Maintenance of Variance (MOVE). While these two techniques will succeed in maintaining the variance of the original data, they each have drawbacks. The first technique will limit the reproducibility of the new data due to the addition of a stochastic component to the model, and MOVE has been shown to overestimate the variance and inflate interstation correlation with the independent gage in some cases. Should regression techniques be used in data extension and filling, it is suggested that a statistical comparison of the original and new data series be performed to ensure satisfactory maintenance of the statistics. The preliminary review of statistical packages for this study did not go into sufficient depth to investigate whether or not these techniques were available in most of the packages. The second method uses a multivariate approach to data filling, in this case referring to several stations or gages, as opposed to more than one parameter (Salas et al. 1994). The multivariate model takes into account the cross correlations between a set of series with missing values and a set of series with complete records. The missing values at the dependent stations at time t+1 are described by a linear combination of the values at those stations at time t, and the values of the independent stations at times, t and t+1. The equation takes the following form. Y X~ The linear parameter matrices A and B, are estimated using standard time series estimation procedures, and s is a stochastic component introduced to maintain variance. Designating p as the number of dependent gages and m as the number of independent gages, Y will be a 1 x p matrix, A will be a (p+2m) x p matrix, and B will be p x p matrix. The multivariate method is advantageous in that several short or incomplete records can be filled at once. All of the gages (independent and dependent) are used to estimate missing values, taking into account lag-1 autocorrelations in the incomplete or short records. The disadvantages are that all periods Appendix E E-19