Introduction

The use of watershed basin classification system has been promoted by hydrologists. McDonnel and Woods (2004) suggested that such system has advantages not only as identification of  important controls on water fluxes and pathways, but also as a common language to facilitate discussions of inter-catchment similarities and differences at regional, and boarder scales. Wagener et al. (2007) proposed that catchment classification would assist in choosing models for application to poorly understood hydrologic systems, guiding  decisions in the choice of simulation methods for predictions in ungauged streams, providing  insights into the potential impacts of climate and land use changes on catchment scale hydrological responses.

Catchments have been classified based on the streamflow regime in many studies. The streamflow regime is a result of the interactions between climate and catchment characteristics, and the prediction of the streamflow regime at ungauged locations within a stream network is a fundamental problem in hydrology (Wagener & Wheater, 2006).

A number of approaches have been developed for prediction of streamflow regime in ungauged basins, including both process-based approaches and empirical models. Process-based hydrological models, such as the Distributed Monthly Water Balance Model (DMWBM) developed by Moore et al. (2012), have succeeded in well-gauged regions. The DMWBM uses gridded climate data and topographic data (e.g., digital elevation model) as input, and simulates the streamflow regime by estimations of monthly streamflow using water balance approach. It achieved remarkable results in British Columbia, Canada, but such method requires high resolution spatial data and a considerable amount of effort. On the other hand, empirical approaches model the relations between streamflow characteristics of gauged streams and climate, topography and land cover characteristics in those basins. Empirical models require less effort in general, but the performance of such models highly depends on the density of gauging stations. This method may be biased when the gauging stations are not evenly distributed. However, even in developed countries such as Canada, many regions are sparsely gauged (Moore et al., 2012).

The objective of this study is to assess the ability of the Extreme Gradient Boosting (XGB) algorithm, a newly developed machine learning algorithm, to classify streamflow regimes in BC, Canada. The hope is to develop a reliable method that requires relatively low data input (e.g., basic climate and geographical characteristics of the basin, and some examples of classified streams for training the model). Meanwhile, as the XGB algorithm is based on decision tree statistical method instead of traditional regression, it could also help to mitigate the error caused by the biased sampling locations. In addition, as the XGB also has the ability to perform numerical predictions, a simple model for estimation of mean annual streamflow was developed as well in order to test machine learning’s potential in magnitude prediction of streamflow for ungauged locations.