Abstract:
Interest in the automated survey of remotely sensed data for archaeological features has grown markedly in recent years (Davis 2020). Enthusiasm for AI-based feature detection in satellite and aerial imagery is understandable, as it holds the potential to dramatically expand scales of analysis to interregional and even continental views of archaeological distributions. Yet assessing AI-based feature detection accuracy is key to establishing reliable AI-based imagery survey. To do so, the results of automated survey techniques must be compared to independent datasets collected by more traditional means, such as systematic visual survey of satellite imagery. The practice of withholding a randomly selected partition of the data for the purposes of model evaluation is standard practice in machine learning workflows (Tan et al. 2021), however, this practice must be further examined with remotely sensed geographic data, where spatial dependency must also be considered in model evaluation. Spatial autocorrelation (systematic variation as a function of distance) suggests that adjacent image tiles cannot be treated as independent samples (Miller 2004). Instead, it should be expected that adjacent tiles are correlated to each other merely due to their proximity. This results in overconfidence in the model if adjacent or spatially proximate tiles are included in both the training set and the datasets reserved for model evaluation. This paper forms the first step towards a credible, reliable and thoroughly tested automated archaeological survey in the south-central Andean Highlands. To evaluate the effects of spatial autocorrelation on model evaluation, a convolutional neural network is trained and evaluated on data from the western cordillera of the southern Peruvian Andes, which has been treated in two different ways. First, the data was split into training and validation datasets, using a naïve random sampling strategy. As a result, some evaluation tiles are spatially adjacent to the image tiles used for training. The data is then split a second time, with steps taken to ensure spatially proximate tiles are not split between the training and test sets. Evaluations of the model performance are then compared to examine the effects of spatial autocorrelation. This comparison of training data and model validation methods demonstrates that failing to account for spatial autocorrelation in model training data can lead to a substantial overestimation of model performance. Beyond this cautionary tale, this paper contributes a workflow for producing reliable training data and estimates of model performance for AI-assisted archaeological imagery survey.