Abstract:
Bioarchaeological research is producing ever increasing amounts of data from finite resources. To ensure that the wealth of information contained within these studies is available for reuse, it can be beneficial to use the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. From investigations into the FAIRness of bioarchaeological datasets it was revealed that more must be done to increase the reusability of datasets. This is particularly the case for osteoarchaeological and palaeopathological datasets. Furthermore, osteological data is currently being shared in published reports using PDF format, for functions outside of their original design, providing limited opportunities for data extraction and analysis. This research paper explores the use of Natural Language Processing (NLP) and Named Entity Recognition (NER) to overcome the shortcomings of PDFs in their current use and provide greater opportunities for data reuse in line with FAIR data principles. These two technological approaches were tested through the creation of a prototype system to search for osteoarchaeological terms within the Archaeology Data Service archive. Their application was then analysed for accuracy, time-saving ability, usefulness, accessibility, whether users would consider using it again and reliability by professional bioarchaeologists, students and the public. From the results, despite some limitations, it is shown that there is real potential in the use of NLP and NER to allow osteoarchaeology and palaeopathology information to be accessed more easily, thus unlocking the data trapped within ‘grey literature’.