Draft DISCRIMINATION BETWEEN A GROUP OF THREE-PARAMETER DISTRIBUTIONS FOR HYDRO METEOROLOGICAL FREQUENCY MODELING

We recommend methods of discrimination between some three-parameter distributions used in hydro-meteorological frequency modeling. Discriminations are between model pairs belonging to the group {generalized extreme value (GEV), Pearson Type III (P3), generalized logistic (GLO)}. To assess the fit of these distributions to data, the Akaike information criterion (AIC), Bayesian information criterion (BIC), and/or goodness-of-fit measures are commonly employed. However, it is difficult to estimate the discrimination power and bias of these methods when used with three-parameter distributions. Consequently, we propose two alternative tools and assess their performance. Both tools are based on a sample transformation to normality followed by applying a powerful statistic for testing normality, such as the Shapiro-Wilk or the probability plot correlation coefficient statistic. While arriving at recommendations for discriminating between the (GEV, GLO) and (P3, GLO) pairs of models, we show that the discriminati...


Introduction
Hydro-meteorological frequency analysis is concerned with analyzing the magnitudes of hydrometeorological events and assessing their probability of occurrence, for use in risk assessment and management.These events can be floods, droughts, extreme rainfalls, or other extremes.In hydrology, the identification of a statistical distribution to model the frequency of occurrence of extreme events, and an efficient estimation of the distribution's parameters, are important.At present, a need exists for improved methods of selecting an appropriate statistical distribution to fit hydro-meteorological data.Such improved methods help reduce the error involved in quantile estimation.When it is reasonable to assume the data series to be independent and identically distributed (iid), hydrologists often fit several candidate distributions to the data and aim to choose the model with the best possible fit.The selection of a distribution is generally done on the basis of goodness-of-fit tests, in conjunction with graphical methods such as probability plots, but a fair degree of subjectivity still exists in the process involved in distribution selection.Lmoment ratio diagrams (Hosking and Wallis 1997) are graphical methods used to explore the suitability of different statistical distributions, but the subjectivity involved in these methods limits their suitability for discriminating between candidate distributions.
After carefully assessing a group of candidate distributions, hydrologists often end up having to discriminate between a pair of models that appear to fit the data well, but only one of them is to be selected for the application at hand.Goodness-of-fit tests are generally used as a basis for rejecting some distributions but not to select the best distribution.This is why applying goodnessof-fit tests to two distinct models quite often leads to the rejection of neither of them, but the user still needs to make a unique model choice in an objective manner.Choosing the best model becomes especially important when estimating events in the distribution's tail(s), where the effect D r a f t of model assumption is critical.This makes it important to look for the most powerful and preferably most practical method(s) to discriminate between competing models.Two-parameter probability distributions such as the Generalized Pareto, lognormal, gamma or Weibull are useful in fitting datasets in areas such as peaks-over-threshold hydro-meteorological extreme value modeling.Three-parameter distributions are also very important, such as in fitting annual maximum flood or precipitation series.These distributions include the generalized extreme value (GEV), Pearson Type III (P3), Log-Pearson Type III (LP3) and generalized logistic (GLO) distributions.Different countries have adopted some of these distributions for flood or precipitation frequency analysis.The United States Water Research Council (1977) recommends the LP3 distribution for flood frequency modeling, as is the case in Australia (Sangel and Kallio 1977;McMohan and SriKanthan 1981).China recommends the P3 distribution (Chen et al. 2010(Chen et al. , 2012) ) and the British Natural Environment Research Council (NERC 1975) recommends the GEV for flood and rainfall analysis.The GLO distribution is recommended in the United Kingdom as their default flood frequency distribution (Robson and Reed 1999).
The present study focuses on a group of three-parameter distributions important in hydrometeorological frequency modeling.It has the objective of recommending discrimination methods between these models.We will propose some discrimination procedures, justify their selection, and then test and compare them.We will base our comparisons on the procedures' "discrimination power" and "discrimination bias", which we will formally define in the next section.(1) Where: (2)  (Filliben 1975;Vogel 1986, and others).The Akaike information criterion and the Bayesian information criterion are also widely used to assess the fit of a hypothesized distribution to data.
It is noted that past studies have largely focused on discriminating between two-parameter models, which are important in hydrology as already mentioned.However, there is a clear need to pay more attention to discrimination between distributions with more than two parameters.

D r a f t
The ratio of maximized likelihood (RML) statistic is the most widely investigated method for choosing between frequency models.Early statistical studies used this statistic for discrimination (see e.g., Cox 1961Cox , 1962;;Jackson 1968).However, the large-sample basis of this discrimination statistic left some doubts with regard to its applicability to small-or moderate-sized samples.As a result, several subsequent studies (e.g., Dyer 1973;Dumonceaux & Antle 1973;Bain & Engelhardt 1980;Kappenman 1982;Kundu & Manglick 2004;Strupczewski et al. 2006;Aucoin & Ashkar 2010) resorted to Monte Carlo simulations to test the performance of the RML statistic with both large and small sample sizes.It is important to note that when the discrimination is between models with the same number of unknown parameters, such as in the present study, the RML statistic, the Akaike information criterion and the Bayesian information criterion methods become equivalent (as noted, for example, by Ashkar and Ba 2017).Other procedures for choosing between frequency models use goodness-of-fit measures that are based on the empirical cumulative distribution function given in Eq. (1).These include the Kolmogorov-Smirnov, the Cramér-Von Mises and the Anderson-Darling procedures mentioned in the previous section.In applying these procedures, many users choose the model with the lowest goodness-of-fit measure, although this is not necessarily the best way to apply these procedures.We will add a remark concerning this in section 3.
Some goodness-of-fit statistics aim at assessing the adequacy of one specific type of frequency model to fit a dataset.For example, the Shapiro-Wilk (SW) statistic is widely used to assess the fit of the normal distribution.Ashkar et al. (1997) proposed a modification to this statistic in order to allow for discrimination between a pair of non-normal distributions.This modification calls for transforming the sample to approximate normality and then applying the Shapiro-Wilk statistic to the transformed sample.We will denote this procedure by "TN.SW".1989;Chowdhury et al. 1991;Vogel & McMartin 1991;Heo et al. 2008;Kim & Heo 2010).In the present study, we will investigate this discrimination statistic both in its original form and in a modified form similar to the one just mentioned for the Shapiro-Wilk (SW) statistic.This modified form calls for transforming the sample to approximate normality, then applying the PPCC statistic to the transformed sample.We will denote this procedure by "TN.PPCC".

Discriminating between three-parameter models
The focus of this study is to propose and compare practical methods of discrimination between three-parameter distributions such as GEV, P3, and GLO, which are important in hydrometeorological frequency modeling.In discriminating between a pair of models (M1, M2), we will focus on those discrimination statistics that do not cause numerical difficulties in fitting either model.The most important potential numerical difficulty lies in the non-convergence of the method used to fit model(s) M1 and/or M2.By Monte Carlo simulation, we will assess the performance of the discrimination statistics by analyzing their probabilities of correct selection (PCS) between the models M1 and M2.Based on such analysis, we will attempt to identify the discrimination statistics that possess the highest discrimination power between M1 and M2 and the lowest discrimination bias, where power and bias will be measured as follows: • Let ‫ܵܥܲ‬ ெଵ denote the PCS when M1 is the assumed true model, and ‫ܵܥܲ‬ ெଶ be defined in an analogous manner.Details on how to calculate ‫ܵܥܲ‬ ெଵ and ‫ܵܥܲ‬ ெଶ will be presented in section 5.

D r a f t
• Calculate ‫.ܵܥܲ‬ ݉݁ܽ݊, as follows, and use it as a measure of discrimination power: (3) ‫.ܵܥܲ‬ ݉݁ܽ݊ = ௌ ಾభ ାௌ ಾమ ଶ • Calculate ‫.ܵܥܲ‬ ‫.ݏܾܽ‬ ݂݂݀݅, as follows, and use it as a measure of discrimination (absolute) bias: (4) ‫.ܵܥܲ‬ ‫.ݏܾܽ‬ ݂݂݀݅ = ‫ܵܥܲ|‬ ெଵ − ‫ܵܥܲ‬ ெଶ | A necessary first step in discriminating between models is to estimate their parameters from the data.The maximum likelihood estimation method is often favored based on its desirable properties of being consistent, asymptotically normal, and asymptotically efficient under the classical regularity conditions (Nagatsuka et al. 2014).Another attractive property of the maximum likelihood method is that it is invariant to reparameterization.However, for threeparameter distributions such as P3, the maximum likelihood method is often eliminated because it leads to problems such as the non-existence of estimates in some range of the parameters, large variability, and convergence problems pointed out by many authors (e.g., Bobée and Ashkar 1991;Nagatsuka et al. 2014).Moreover, in Monte Carlo simulations such as those to be employed in the present study, when the generated and the fitted distributions are not of the same type (for example, generated = GEV, fitted = GLO), the maximum likelihood method should be eliminated altogether because it consistently leads to non-convergence.For these reasons, we will not use maximum likelihood as the fitting method but will replace it by the method of probability weighted moments, which we found not to suffer from the aforementioned drawbacks exhibited by the maximum likelihood method.

D r a f t 3 The discrimination statistics to be tested
As mentioned earlier, the ratio of maximized likelihood (RML) statistic is widely used to choose between a pair of competing models.In previous RML studies, which focused largely on discriminations between two-parameter distributions, the RML method did not pose any numerical problems.It also performed relatively well, although it was not necessarily the best method compared to others; see, e.g., Ashkar andAucoin (2012a, 2012b), Ashkar and Ba (2017).
However, attempting to use RML to discriminate between three-parameter distributions, led to some serious numerical problems.The reason being that this method calls for using maximum likelihood for parameter estimation, and as mentioned earlier, applying maximum likelihood to 3parameter models led to convergence problems that necessitated replacing it by the probability weighted moments method.These problems occurred when the Monte-Carlo-generated distribution and the fitted distribution were not of the same type.For this reason, the RML method, and consequently the Akaike information criterion and the Bayesian information criterion, will not be pursued further in the present study, due to difficulties in calculating their discrimination power and bias by Monte Carlo simulation.
In applying goodness-of-fit measures based on the empirical cumulative distribution function (e.g., Kolmogorov-Smirnov, Cramér-Von Mises or Anderson-Darling) to select between pairs of models (M1, M2), many users simply choose the model that provides the lowest goodness-of-fit measure (e.g., the lowest value of the Anderson-Darling statistic).However, it can be shown that applying these procedures in this manner could lead to a large discrimination bias, by favoring one competing model over the other (Ashkar andAucoin 2012a, 2012b).To partially correct for this bias, a "p-value-based" approach for choosing between M1 and M2 could be proposed (Ashkar and Aucoin 2012a).However, this approach demands computer programming that The probability plot correlation coefficient (PPCC) goodness-of-fit statistic is another statistic used in model testing.As already mentioned, Filliben (1975) developed this statistic to test normality, but later studies expanded its use to test other models.This statistic is based on the correlation R between the ordered observations ܺ () and the corresponding fitted quantiles where ‫‬ is a plotting position of ܺ () .The plotting position formula used in the present study was the Hazen formula: p i = (i -0.5)/n.The statistic R is given by: (5) where ܺ ത and ܹ ഥ are the means of and of = ሼܹ ሽ ݊ ݅ = 1 respectively.Based on the foregoing discussion, the three discrimination statistics that we will focus on in the remainder of this study are TN.SW, PPCC, and TN.PPCC.

The distributions to consider
The discriminations that we will consider are between model pairs belonging to the group {GEV, P3, GLO}.Numerical problems encountered with the LP3 distribution prohibited it from being included in the analysis, as we will explain later in this section.
Hereafter, we present the probability density function, cumulative distribution function and quantile function for the GEV, GLO and P3 distributions.Appendix A provides additional information on these models, such as moment and probability-weighted-moments parameter estimators.

D r a f t
The cumulative distribution function, probability density function and quantile functions of the GEV distribution are respectively given by: where b, c, and k are scale, location, and shape parameters, respectively; and where: For the GLO distribution, we have: where b, c, and k are scale, location, and shape parameters, respectively.The support (or sample space) of the GLO distribution is the same as that of the GEV distribution, given in Eq. ( 7).
For the P3 distribution, we have: where .denotes the gamma function 3 and ( ) are calculated numerically 3 where α, m, and λ are scale, location, and shape parameters, respectively; and where: As we mention in Appendix A, the lmom package of the computing language R was the one used to estimate the P3 distribution parameters by probability weighted moments.Using this package with the LP3 distribution produced some convergence problems, which led to the exclusion of the LP3 distribution in this study.

Comparing the discrimination statistics by Monte Carlo simulation
Our objective is to compare three discrimination statistics: PPCC, TN.PPCC and TN.SW, with respect to their ability to correctly select between three model pairs (M1, M2).These model pairs are (GEV, GLO), (P3, GLO) and (GEV, P3).We will use the probability of correct selection (PCS) to compare the performance of the three discrimination statistics.The following steps describe the Monte Carlo algorithm that serves to calculate ‫ܵܥܲ‬ ெଵ , the PCS when M1 is the assumed true model (when M2 is the true model, a similar algorithm yields ‫ܵܥܲ‬ ெଶ ): In selecting the sample sizes to include in the Monte Carlo experiment, it is useful to focus on those commonly encountered in hydrological practice.In this study, we chose sample sizes n = 20 (20) 100, which we consider to be representative of those frequently encountered in hydrology.
The PCS results to obtain from the Monte Carlo experiment depend on the parameter(s) of the distributions being considered, especially the shape parameter(s) values of these distributions.
Since we are dealing with three-parameter distributions, we need to incorporate several 3dimensional input vectors θ for each distribution into the Monte Carlo experiment.To select input parameters θ that are representative of those found in hydrological practice, we chose to base the selection on real hydrologic data.To this end, we considered annual maximum streamflow series recorded at 220 hydrometric stations from across Canada, chosen from the Water Survey of Canada's HYDAT database.These stations have been screened for regulation, diversions, or land use influences or changes; they are considered to have good quality data.For each series, we calculated the mean, the coefficient of variation c v , and the coefficient of skewness c s .Figure 1 depicts a plot of c s versus c v for these 220 series.The three quartiles of the observed c s values are Q 1 = 0.55, Median = 0.97 and Q 3 = 1.55.From the 220 series, we selected a subset of 20 series with c s values more or less uniformly distributed among the 220 observed c s values.The objective was to choose a set of stations that exhibit a wide range of skewness in their D r a f t frequency distribution.Table 1 presents the 20 selected series, ordered according to their c s values from smallest to largest.The series are identified as #1 to #20.It is noted that the c s values range from 0.17 to 2.62, with the first half having c s < 1.00 and the second half having c s > 1.00.
The table displays for each series, the mean, the standard deviation s, c v and c s .In choosing the 20 series, no consideration was given to any potential numerical difficulties to be encountered later in the analysis.However, with the parameter estimation method used, convergence problems were in fact later encountered with the last two stations in Table 1, which are the ones with c s > 2.00.We therefore eliminated these two stations and limited consideration in our Monte Carlo simulations to the first 18 stations for which the estimation method converged.From each of the 18 ‫̅ݔ(‬ , c s , c v ) combinations, parameter vector estimates ߠ were obtained for each distribution (GEV, P3, GLO) by the method of moments, using equations presented in Appendix A. For each distribution, we used these 18 parameter vector estimates as input into the Monte Carlo experiment.We will refer to these as: θ ௨௧ Vectors #1, #2, …, #18.
However, we can confirm that these results are representative of the complete Monte Carlo results obtained for PPCC and TN.PPCC.
It is readily noted from Table 2 that the probabilities of correct selection, ‫ܵܥܲ‬ ெଵ and ‫ܵܥܲ‬ ெଶ , vary considerably.If we take as an example the case: [θ ௨௧ vector #4, discrimination statistic = PPCC, (M1, M2) = (GEV, GLO), n = 100], the reported PCS values are 90(68).This means that D r a f t the PPCC method correctly identifies the GEV distribution as the true model 90% of the time, while correctly identifying GLO as the true model only 68% of the time.The average PCS value (Eq.3): ‫.ܵܥܲ‬ ݉݁ܽ݊ = (90% + 68%) / 2 = 79% is a measure of discrimination power.
From the information presented in Table 2, the two measures of discrimination power and discrimination absolute bias were calculated and are presented in Table 3.This table simplifies the comparison between PPCC and TN.PPCC.It shows that the difference in performance between these two discrimination statistics lies more in their discrimination absolute bias than in their power.This is clearly seen by referring to the last two rows marked "Average" in Table 3, as well as by referring to Figure 2 (A & B).In fact, Figure 2 (A) shows the difference in discrimination power between PPCC and TN.PPCC to be small, but Figure 2 (B) shows the difference in discrimination absolute bias to be much more substantial.From Figure 2 (B), and the last two rows marked "Average" in Table 3, the following conclusions may be drawn: 1) In the (GEV, GLO) discrimination, TN.PPCC decisively outperforms PPCC by producing lower discrimination absolute bias; 2) In the (P3, GLO) discrimination, TN.PPCC generally outperforms PPCC, except for the small sample size of n = 20; 3) In the (GEV, P3) discrimination, both TN.PPCC and PPCC fail to produce PCS.mean values much larger than 50%, which points to a difficulty in discriminating between the GEV and P3 distributions based on the population parameters and sample sizes included in the Monte Carlo experiment.

D r a f t
From the complete results obtained from the Monte Carlo experiment, we found no clear reason to continue investigating the PPCC discrimination statistic, because it was quite steadily outperformed by TN.PPCC with regard to discrimination absolute bias.Therefore, in the remainder of this section, we will shift attention to comparing the TN.PPCC and TN.SW discrimination statistics.

Comparing TN.SW and TN.PPCC
Table 4 and Figure 3 present the Monte Carlo simulation results for (GEV, GLO) discrimination.
Figure 3 shows the difference between TN.SW and TN.PPCC to lie mainly in their discrimination absolute bias (Figure 3 (B)) than in their discrimination power (Figure 3 (A)).The same conclusion is arrived at by referring to the last two rows in Table 4 marked  PCS mean error less than ±5%), as can be seen from the last two rows marked "Average" in Table 4.

D r a f t
Table 5 and Figure 4 present the results for (P3, GLO) discrimination.Figure 4 shows the difference between TN.SW and TN.PPCC to lie more in their discrimination absolute bias (Figure 4 (B)) than in their discrimination power (Figure 4 (A)).The same conclusion is arrived at by referring to the last two rows marked "Average" in Table 5.These two rows show TN.SW to outperform TN.PPCC by producing lower discrimination absolute bias for all sample sizes considered.Therefore, it seems reasonable to consistently recommend TN.SW for (P3, GLO) discrimination, while keeping in mind that for small sample size (e.g., n = 20) and when there is an indication that the population skewness is quite large (e.g., cs > 1.0), TN.PPCC may be preferable to TN.SW.From such a recommendation, the expected PCS.mean value should be ≈ 61% for 20 n = and ≈ 75% for ݊ = 100, as can be seen from the last two rows marked "Average" in Table 5.The expected PCS.abs.diffshould be less than about 10% (PCS mean error less than ±5%) for 20 n = and less than about 6% (PCS mean error less than ±3%) for ݊ ≥ 40.
Table 6 and Figure 5 present the results for (GEV, P3) discrimination.Once again, Figure 5 shows TN.SW and TN.PPCC to have more differences in their discrimination absolute bias (Figure 5 (B)) than in their discrimination power (Figure 5 (A)).The same conclusion is arrived at by referring to the two rows marked "Average" in Table 6.These two rows and Figure 5 (A) show that both TN.SW and TN.PPCC fail to produce PCS.mean values much larger than 50%.In fact, one should not expect to obtain a PCS.mean value greater than 60% except when there is an indication that the population skewness is quite large; e.g., cs > 1.2.These results point to a difficulty in discriminating between the GEV and P3 distributions, at least from what our Monte Carlo experiments could show.

D r a f t
As an application, we shall revisit the 18 stations considered in the previous sections.Table 7 contains the statistics computed for these 18 stations.The column marked "Chosen Model" shows the selected model for each station by applying the TN.SW and TN.PPCC statistics.In this column, it is seen that GEV appears six times, P3 seven times and GLO three times.These are the cases where both the TN.SW and TN.PPCC statistics lead to the same model choice.However, for two of the stations, which are marked by the symbol *, the two statistics do not lead to selecting the same model.These are stations # 5 and # 12.For these two stations, the TN.SW method recommends choosing GEV whereas TN.PPCC chooses GLO.So, the choice of model for these two stations requires revisiting our conclusions drawn from the simulation experiments.
For this purpose, we will focus on Figure 3 (B) to find which model to recommend for these two stations.Since the sample size of station # 5 is equal to 42 and that of station # 12 is equal to 41, we will refer in Figure 3 (B) to the case n = 40.In this case, it is seen that TN.PPCC exhibits smaller discrimination bias as compared to TN.SW.Therefore, for these two stations, we will choose GLO as a model, in accordance with the TN.PPCC method.
As a further comparison of the distributions being investigated, we estimated the 100-year event by all three of them for the 18 stations by probability weighted moments.The GLO model consistently gave the highest 100-year-event estimates in comparison to the other two models.
This may indicate a tendency of the GLO distribution to overestimate the 100-year event.The absolute difference between the highest and the lowest 100-year-event estimates was also calculated and is presented in the last column of Table 7.We express this absolute difference as a percentage of the highest 100-year-event estimate; i.e., is seen that the calculated absolute difference for the 18 stations varies between 4.8% and 12.5%: a difference that could be considered quite substantial.

Discussion and conclusion
We dealt with discriminations between pairs of statistical distributions belonging to the group {GEV, P3, GLO}.We focused on sample sizes commonly encountered in hydro-meteorological applications and aimed at finding discrimination methods that are relatively powerful and easy to use.In assessing the methods' performance, we paid attention to both discrimination power and discrimination bias.While seeking high discrimination power is usually a key goal, seeking low discrimination bias is also essential, because we need to avoid favoring one competing model over the other in the discrimination.In our view, the assessment discrimination bias has not been given the attention that it deserves in previous research.
To assess the fit of distributions to data, the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and/or a goodness-of-fit measure such as the Anderson-Darling (AD) statistic are widely used in hydrological applications.However, we uncovered some inconveniences of these model choice methods, mainly relating to difficulties in assessing their performance in terms of discrimination power and discrimination bias.In applying the AD method, one should not simply choose the model that gives the lowest AD statistic, because this generally leads to a large discrimination bias between the competing models, as mentioned in section 3.For better discrimination, the "p-value-based" approach is more appropriate, but this approach requires a degree of computer programming that limits its appeal in practice.For choosing between models with a different number of unknown parameters, the AIC and BIC criteria have the appeal of offering a trade-off between model goodness of fit and complexity.
However, because of their strict reliance on maximum likelihood estimation, they had to be arising from an unknown distribution, goodness-of-fit tests serve to check the compatibility of with a hypothesized cumulative distribution function F. Since some or all parameters of F are usually unknown, the user first needs to estimate them from the data.Hence, parameter estimation plays an important role in applying goodness-of-fit tests.Commonly used parameter estimation methods include those of maximum likelihood (ML), of moments and of probability weighted moments (PWM).Applying an estimation method to the sample, yields an estimate ‫ܨ‬ of F.Continuous probability distributions are particularly important in hydro-meteorological frequency modeling.When F is continuous, some popular goodness-of-fit tests make use of the empirical cumulative distribution function in assessing the fit of F to the data.The empirical cumulative distribution function is given by: vs. Inverse Gaussian (Strupczewski et al. 2006); lognormal vs. log-logistic (Dey & Kundu 2010; Aucoin & Ashkar 2010; Ashkar & Aucoin 2012b); Weibull vs. log-logistic (Ashkar & Aucoin 2012a);

1.
Fix the sample size n and generate a sample of n iid observations from model M1; 2. Fit both models M1 (the correct model) and M2 (the wrong model) to by probability weighted moments; each of the three discrimination statistics and record the test result as "1" if model M1 is selected (correct decision) or "0" if model M2 is selected (wrong decision); 4. Repeat the preceding steps N = 1000 times, and store the test results (I j ; j = 1,…, N) in a vector I, which would contain only zeros and ones; 5. Calculate ∑ ‫ܫ‬ ே ୀଵ ܰ ⁄ and use it as an estimate of ‫ܵܥܲ‬ ெଵ .
"Average".These two rows show TN.PPCC to outperform TN.SW by producing lower discrimination absolute bias when ݊ ≤ 80, but TN.SW outperforms TN.PPCC for ݊ = 100.The average PCS.mean value to expect from applying either of the two discrimination statistics is ≈ 59% for ݊ = 20 and ≈ 77% for ݊ = 100.Figure 3 (A) shows PCS.mean to decrease as the population skewness increases for both TN.SW and TN.PPCC.Figure 3 (B) shows TN.PPCC to outperform TN.SW by producing lower discrimination absolute bias when ݊ = 20, 40, but for ݊ ≥ 60 both discrimination statistics show comparable discrimination absolute bias performance.As a practical recommendation, it seems reasonable to propose the use of TN.PPCC when ݊ ≤ 50 and to propose either of the two methods when ݊ > 50.Under such a recommendation, the expected PCS.abs.diffvalue (a measure of discrimination absolute bias) should be less than 10% (i.e. a

Figure 1 .Figure 2 .
Figure 1.plot of c s versus c v for the 220 annual maximum streamflow series.
TN.PPCC statistics ܴ * ெଵ and ܴ * ெଶ is done as follows: a) Start by assuming M1 to be the true model for the sample and use ܼ * = ߶ ିଵ ‫ܨ(‬ ெଵ (ܺ )) to transform to an approximate ܰ(0, 1) sample * ; b) calculate * = ሼܹ * ሽ ݊ ݅ = 1 where ܹ * = ߶ ିଵ ቀ‫ܨ‬ (ܹ )ቁ = ߶ ିଵ ‫(‬ ); c) calculate the TN.PPCC statistic ܴ * ெଵ using the formula: Assume M2 to be the true model and repeat steps a) to c) to calculate ܴ * ெଶ makes it unattractive in practice.Therefore, for practical reasons, we will not pursue methods that are based on the empirical cumulative distribution function (e.g.those that use the Anderson-Darling statistic); although they may be interesting to pursue in future research.
statistic.Appendix B presents the steps involved in calculating the TN.SW statistic ܵ ெଵ (resp.ܵெଶ ) when M1 (resp.M2) is the fitted model.If ‫ݏ‬ ெଵ (resp.‫ݏ‬ ெଶ ) is a realization of ܵ ெଵ (resp.ܵ ெଶ ) based on an observed sample , the decision rule is to choose M1 as the true model if ‫ݏ‬ ெଵ > ‫ݏ‬ ெଶ and to choose M2 otherwise.
Essentially, R measures the linearity of the probability plot of ܹ versus ܺ () , and values of R close to 1.0 indicate a good fit of the distribution to the sample.One may choose to use R to discriminate between two models (M1, M2).Let ܴ ெଵ (resp.ܴ ெଶ ) be the PPCC statistic when M1 (resp.M2) is the fitted model.If ‫ݎ‬ ெଵ (resp.‫ݎ‬ ெଶ ) is a realization of ܴ ெଵ (resp.ܴ ெଶ ) based on an observed sample , the decision rule would be to select M1 as the true model if ‫ݎ‬ ெଵ > ‫ݎ‬ ெଶ and to select M2 otherwise.However, our experience has shown that a modification of the R statistic tends to improve its performance.This modification, which we denoted in subsection 2.2 by TN.PPCC, employs a sample transformation to normality, followed by an application of the PPCC statistic.The calculation of the TN.PPCC statistics ܴ * ெଵ and ܴ * ெଶ is outlined in Appendix B. If ‫ݎ‬ * ெଵ (resp.‫ݎ‬ * ெଶ ) is a realization of ܴ * ெଵ (resp.ܴ * ெଶ ) based on an observed sample n x , the decision rule is to choose M1 to be the true model if ‫ݎ‬ * ெଵ > ‫ݎ‬ * ெଶ and to choose M2 otherwise.

Table 1 .
eliminated in this study due to serious difficulties in assessing their performance by Monte Carlo simulation, as we specified in subsection 2.3.For these reasons, we proposed TN.SW and TN.PPCC as discrimination tools, and were able to assess their performance by Monte Carlo simulation.Both of these tools are based on the idea of transforming the sample to approximate normality prior to performing model discrimination.TN.SW is a recently developed tool that needed further validation, while TN.PPCC is a new one that has been proposed and tested for the first time in the present study.An important practical advantage of both these methods is that they are easy to use.For (GEV, GLO), (P3, GLO) and (P3, GEV) discriminations, TN.SW and TN.PPCC differed mainly in their discrimination bias, but not in in their discrimination power.For (GEV, GLO) discrimination, our recommendation was to use TN.PPCC when ݊ ≤ 50, and either TN.SW or TN.PPCC when ݊ > 50.With such a recommendation, the expected PCS mean error should be no more than ≈ ±5%, with a PCS.mean value ≈ 59% for ݊ = 20, and ≈ 77% for ݊ = 100.For (P3, GLO) discrimination, our general recommendation was to use TN.SW.Associated with such a recommendation is an expected PCS mean error of no more than ≈ ±3% and a PCS.mean value ≈ 61% for ݊ = 20 and ≈ 75% for ݊ = 100.In the (P3, GEV) discrimination, both TN.SW and TN.PPCC failed to produce PCS.mean values much larger than 50%.This pointed to a difficulty in discriminating between the P3 and GEV distributions for the range of population parameters and sample sizes covered by our Monte Carlo simulations, which we chose based on observed annual maximum streamflow series from 220 Canadian hydrometric stations.The discrimination between P3 and GEV may prove to be less difficult if the population parameters are changed.Information on the 20 chosen annual maximum streamflow series

Table 2 .
Probability of correct selection (PCS), in percentage, rounded to the nearest integer, for comparing PPCC and TN.PPCC.Outside of the brackets are PCS values under GEV as the true model; within brackets are PCS values under GLO as the true model.

Table 3 .
PCS means and absolute differences, both in percentage rounded to the nearest integer, for comparing PPCC and TN.PPCC.Values of

Table 4 .
Comparing TN.SW and TN.PPCC for (GEV, GLO) discrimination.This table gives PCS means and absolute differences, both in percentage rounded to the nearest integer.PCS.meanvalues are outside the brackets; PCS.abs.diffvalues are within brackets.In the last two rows of the table, the "best" PCS results are highlighted.

Table 5 .
Comparing TN.SW and TN.PPCC for (P3, GLO)discrimination.This table gives PCS means and absolute differences, both in percentage rounded to the nearest integer.PCS.mean values are outside the brackets; PCS.abs.diffvalues are within brackets.In the last two rows of the table, the "best" PCS results are highlighted.

Table 6 .
Comparing TN.SW and TN.PPCC for (GEV, P3)discrimination.This table gives PCS means and absolute differences, both in percentage rounded to the nearest integer.PCS.mean values are outside the brackets; PCS.abs.diffvalues are within brackets.In the last two rows of the table, the "best" PCS results are highlighted.

Table 7 .
Use of the TN.SW statistic s and the TN.PPCC statistic r * to choose between the GEV, 570 GLO and P3 models for the 18 data series discussed in the Application 571 ID Sample size s.GEV s.P3 s.GLO r*.GEV r*.P3 r*.GLO Chosen Model