Machine Learning and the Digital Humanities

Machine Learning and the Digital Humanities

Machine learning, or the capacity for a computer to “learn” and improve its capacity to solve problems and recognize trends through experience with large amounts of input data, is becoming increasingly prominent. The development of complex digital algorithms, combined with increasing hardware capacity which improve the speed with which computers can operate them, have in turn spurred on the development of impressive models which can teach computers to independently perform tasks and solve ever more complex problems. This developing situation bodes well for digital humanities, especially those working with historical manuscripts. Continuing problems in manuscripts include daunting volume of words, obscure language, illegibility due to the fragmentation and degradation of copies, or even the presence of entirely lost texts across other manuscripts as quotes or unattributed excerpts. Added to this, there is the tedious, time consuming and expensive process of digitization. New machine learning models offer solutions to these problems. As discussed in the CS50 podcast by David J. Malan and Colton Ogden, both unsupervised and supervised forms of learning pioneered in consumer analysis work with massive sets, categorizing data into multiple clusters and identifying anomalies. When applied to manuscripts, machine learning models can work with entire corpuses of texts, comparing them and identifying anomalies in the form of words, phrases, and passages. These anomalies can be things like word choice, for example if certain words are more present in a text or group of texts, or the presence of copied and paraphrased passages in multiple different texts. Handwriting recognition is another intriguing avenue that has been opened up by machine learning. Models using machine learning have evolved to be able to read and identify handwriting, a revolutionary development because computers have struggled to do this in the past. This allows computers to read digitized manuscripts and convert it to manipulable text. Eventually, computers may even be able to identify different individuals within copying regimes in individual manuscripts, or across single and multiple corpuses, offering new lines of inquiry for specialists of intellectual history interested in individuals and variations within the text and paratext of manuscripts. Machine learning offers exciting prospects for digital humanities.