project1
Second Mini Project
In this more advanced and independent adventure in the Digital Humanities, I aimed to utilize digital humanities methods to visualize certain integral texts related to my thesis. This would be in the form of the different
A project of this variety invariably begins with assembling a corpus of machine-readable text to be subjected to machine-learning processes. The initial project to OCR and then acquire text forms of the files was beset by issues. In the corpus, some strong OCRs fortunately already existed in the form of Gallica OCRs, the French national library’s (BnF) service for storing digitized pieces of their collection. However, certain entries, including the Deux Campagnes en Haut-Senegal-Niger by Henri Frey had problems with accuracy, and with translating from OCR to text. An attempt to force another OCR using different means, failed on my laptop because it was too intensive a process for my specs. Instead, it proved more efficient to manually copy the parts of the original OCR of the document I desired, from the 1887-8 period, in to a text file.
Developing a visualization involved researching and testing the python word cloud extension on a small part of my corpus. A frequency table was also added to the code to show the data behind the developed visualizations.
Making a useful word cloud requires assembling a table with a large amount of excluded words, including definite and indefinite articles, preopositional and conjunctions and other grammar words of little analytical value. Many prebuilt ones already exist, fine tuned for doing just this kind of frequency analysis. I was able to find one online on a github repository, and after formatting with chat GPT, add it as a function of words to ignore in python.
While my current table has shown marked imrpovemenets over previoud frequency generations, what I would do in the future, as seen in an example from my current tables, is further include variants of these excluded words which appeared in my text due to spelling errors and consequences of the process of normilization, namely the treatment of articles like l’ and d’ behind apostrophes as distinct words.
The world cloud at first looked at the words, primarily names like Mahmadou, Lamine, and Dramé for the leader I wanted to look at, Diana his capital, Soybou his son and deputy ruler of his state, and Ahmadou his strongest indigenous rival ruler. This also inclues terms which are shorthand for other rulers such as al-Hadj, the influential cleric and father of Ahmadou and le marabout, which was extensively used as shorthand to refer to Mahmadou in the 1887-9 period. e Dramé eventually was dropped by me for its comparative rarity in most of the texts, being an important lineage name used by Soninké sources to refer to Mahmadou but not commonly used by the French.
The original plan was to add each paragraph containing a keyword I mentioned. However, the documents in their .txt form had often erratic spacing. I mitigated the obscurity of paragraphs in some of the documents by instead opting to gather chunks of 200 words which mentioned the keywords, along with gaps of 50 words if terms appear across chunks. A further iteration on the code ended up making sure that it looped to find each of the text chunks for each term and made seperate word clouds for each of them, rather than make a single massive word cloud.
In the end, I was left with several ultimately incomplete, though analytically interesting, word clouds for each of the key terms mentioned before. Word clouds and the basic word frequency analyses behind them definitely have limitations to mind when drawing conclusions, especially when using very few, and very specific texts which likely relied on eachotehr to some degree. However, general themes related to the words seem interesting to explore. One of these is a surpising comparative lack of Mahmadou’s numerous letters to French officials, with his regional rival Ahmadou, the erstwhile French ally, more strongly linked to letters and correspondance in the corpus. I think that rethinking parts of my thesis was a helpful part of this exercise, and a more finished exclusion table, corpus of OCR, and a different set of keywords would probably prove benefecial to looking at these texts from other angles.
Digisation mini project 2
The example of the Prophet Muhammad was one which the West African religous leader and anti-colonial resistant, al-Hajj Muhammad al-Amin, sought to emulate in his relatively brief moment ofl eadersihip on the late 19th century cusp of West African imperialism and modernity. A precondition of Muhammad al-Amin’s claim to leadership was the creation of an Islamic religious community of sufficient ideological credentials to justify brutal wars against rival Muslim states, in addition to the Christian French and non-Islamic African peoples. In this respect, it would be a project following the example of West African antecedents in the tradition of Jihad like Usman dan Fodio and Umar Tall, figures Muhamamd al-Amin took direct inspiration from, who both emphasized highly centralized reformist movements led by a singular political and religous authority drawing on the Prophet’s example.
In addition, the lettered, Jakhanke-Soninke scholarly milieu in which Muhammad al-Amin was formed (particularly in the Goundiouru scholarly center in the east of contemporary Senegal, but with contacts spreading far to the West and South where the poem was written) had developed their own prophetology– their own conception of the theological and institutional role of mainly the Prophet Muhammad– but also of other prophets of the Quran, as gleaned from sources such as the Quran and highly popular biographical materials and poetry which circulated throughout the Sahara. One observable outlet of this prophetology appears in the genre of original epic poems that Arabic-literate scholars authored in the middle of the 19th century, the “biniiboo”, or praise poem for the prophet Muhammad.
For my portfolio, I hope to complete a modest digital humanities project which can perform a frequency analysis on selections of a long and important example of this poem authored by the scholar, in Pakao in present day Senegal. The project would aim to guage whether Mecca or Medina/Taybah is mentioned more in the reading. Once this is done and the model has been trained to recognize these strings, a secondary project can be to look at the kinds of episodes that are evoked most strongly in the poem around these places. Hopefully, this will create a methodology by which, later in the thesis, I can work with other examples of praise poetry to make claims about the Soninke scholarly community’s understanding of the prophets and of leadership of a Muslim community, in the period leading up to Muhammad Al-Amin’s taking up of an explicit military-political role as leader of a Jihad movement.
Practically, in this project I would have to begin by digitizing the codex of a little over 300 folios, then apply a transcription model, similar to other projects we have completed in the digital humanities course. Then, I would have to begin coding a simple python loop which can look through the transcription and pull the data I choose. In this case, I will choose to survey the number of occurrences of Medina and Mecca, and then list the page numbers of each appearance of these words in the text. These words can give me a sense of what was important to represent in the life of Muhammad, especially political and social ramifications of creating an Islamic community for scholarly communities, for a scohlarly community in a region with proximity and occaisonally fraught relations between Muslims and non-Muslims. This would give me some tangible data I can work with, in addition to allowing me to create a method for future use in digital humanities projects.
Evaluations: The HTR Values
Several numerical values are used to evaluate HTR. These are WAR, CAR, CER, and WER, which all are a way of mathetmatically representing the difference between a ground truth, or an accurate, manual transcription of document, and the automatic transcription attempted by a given model. HTR values are sufficient, broadly, for providing feedback to develop a model for general use of texts in a given language. They provide those in the digital humanities with a sense of how accurate, overall, a transcription model is for the different texts or corpuses it is used for.
These values, though simple to calculaate and easy to understand, might cover up certain factors which can be useful when analyzing texts. It is common to have different expectations for different kind of texts when the relative effeciency of a model based on its CER, or other HTR value. For some standardized texts, such as those written on a typewriter or printed from computer text in a common, romanized language, nothing less than a CER of 1% can be considered a good outcome. Conversely, a benchmark of 20% can be good for those documents with idiosyncratically handwritten or obscured linguistic content, whether this be a problem of lesser-known language and slang or orthographical errors. The reasoning for mistakes made by the model, however, are leftunclear according to these values. Some scribal traditions might be less known to models, due to language or orthographic practices, and resistant to models despite their uniformity to eachother. In this case, knowing the cause behind the mistakes made by the model would not tell you alot about why a model is struggling with a certain text or group of texts– and you would be unable to know– unless you painstakingly compare yourself both the ground truth and the model-transcribed text. This would, in turn, prove complicated for larger jobs, of texts hundreds of pages in length, which are common in the digital humanities fields. It is best, therefore, for CER and other values to be utilzized with other strategies to ascertain the source of these problems.
Human and Machine descriptions of NLP and NER
NLP NER blog post.
Digitisation Project Complete!
Digitisation Project
Our first mini digitization project had two purposes. One was the immediate objective of the work, to accomplish the eventual digitization of an intriguing Persian work of Ismai’li esotericism. The other goal of the exercise was to provide an instructive experience to myself and other students about many of the specifics of the manuscript digitization process. Participants were given practical experience in the various programs and methods used during digitization, including scanning, the use of EScriptorium, the application and fine-tuning of OCR models, and complex python functions to measure the efficacy of these models.
The book we digitized, entitled In ketaab-e mostatāb seest majmū‘eh-ye moshtamel bar do resāleh-ye mokhtasar dar haqīqat-e mazhab-e Esmā‘īliyah, but referred to in its English title and by the class as the Two Early Ismaili Treatises (henceforth TEIT) is an interesting text dating from 1933. The book, published in Mumbai and written in Persian script along with a short introduction written in English, is a document of historical value. The text shows us the author, an educated Indian Ismai’ili, and his perspectives on esotericism and theology, giving the AKU library in London, and perhaps someday a researcher, a perspective about the Ismai’ili written culture and religion in South Asia during the early 20th century.
Any digitization process starts with the conversion of physical material to digital material. This is accomplished through scanning the wanted material with a machine. Scanning is done through the use of a scanner which isolates the page and produces a PDF or JPG image file. As a rule, only exceptional, valuable and original texts are digitized, because scanning an entire book is a repetitive and labor intensive process, which can largely only be done by a human. For us, the task was made easy by the amount of students cooperating in the process. However, after scanning, quality control is necessary to make sure that the uploaded images are free from anything which might obstruct or complicate the view of a reader, whether human or automated. A pair of students was responsible for making sure the orientation, clarity, and legibility of the document were unimpaired, which included cropping out extraneous parts of the image such as a scanner’s hand, rotating pages to be oriented correctly, or rescanning entire problematic pages. In addition, the proper sequence of the book had to be respected, meaning other mistakes such as duplicates and missing images or faulty numbering of the image files had to be corrected. Completing this task was important to make sure the next step, the OCR process, could proceed.
Four transcription models, developed by kraken for transcribing Arabic and Persian scripts, were tested on a three-page sample from the book. Their performance, represented by CER (Character Error Rate) and WER (Word Error Rate) values, was measured with the usage of a Python script developed by instructors and students at the AKU-ISMC. The script also includes normalisation tables, which are used to further qualify the WER and CER results. Based on another Python script (built with the assistance of ChatGPT) involving basic and easily replicable functions – such as split and len– the highest-performing transcription model was concluded to be kraken-gen2-print-n7m5-union-ft_best. The results were then authenticated against calculations made on an excel spreadsheet. After determining the most accurate model with the lowest WER and CER values, were able to begin the OCR process. This involved using EScriptorium, which is able to apply these models by automatically segmenting texts, or dividing in to its constituent parts, such as main body, footnotes, and page numbers. This importantly ensures that when iti s translated to text, everything appears where it should on the page, instead of all together in a single mess. Then, it can apply the transcription model, producing the transcribed data which is the aspiration of the digitization process.
The digitization project gave myself and the other students valuable, hands-on experience in working with a manuscript. It also helped myself and the other participants become comfortable in applying an array of specific digital humanities skills, such as applying and training OCR models involving both segmentation and transcription. Finally, the benefits of becoming acquainted and confident in using some basic python coding to accomplish tasks, as well as in working through a sizable digital humanities project as a team, is very useful. This basic knowledge provides us an essential experiential background, allowing us to be confident and imagine ways to tackle larger and/or more difficult problems.
Machine Learning and the Digital Humanities
Machine learning, or the capacity for a computer to “learn” and improve its capacity to solve problems and recognize trends through experience with large amounts of input data, is becoming increasingly prominent. The development of complex digital algorithms, combined with increasing hardware capacity which improve the speed with which computers can operate them, have in turn spurred on the development of impressive models which can teach computers to independently perform tasks and solve ever more complex problems. This developing situation bodes well for digital humanities, especially those working with historical manuscripts. Continuing problems in manuscripts include daunting volume of words, obscure language, illegibility due to the fragmentation and degradation of copies, or even the presence of entirely lost texts across other manuscripts as quotes or unattributed excerpts. Added to this, there is the tedious, time consuming and expensive process of digitization. New machine learning models offer solutions to these problems. As discussed in the CS50 podcast by David J. Malan and Colton Ogden, both unsupervised and supervised forms of learning pioneered in consumer analysis work with massive sets, categorizing data into multiple clusters and identifying anomalies. When applied to manuscripts, machine learning models can work with entire corpuses of texts, comparing them and identifying anomalies in the form of words, phrases, and passages. These anomalies can be things like word choice, for example if certain words are more present in a text or group of texts, or the presence of copied and paraphrased passages in multiple different texts. Handwriting recognition is another intriguing avenue that has been opened up by machine learning. Models using machine learning have evolved to be able to read and identify handwriting, a revolutionary development because computers have struggled to do this in the past. This allows computers to read digitized manuscripts and convert it to manipulable text. Eventually, computers may even be able to identify different individuals within copying regimes in individual manuscripts, or across single and multiple corpuses, offering new lines of inquiry for specialists of intellectual history interested in individuals and variations within the text and paratext of manuscripts. Machine learning offers exciting prospects for digital humanities.
My first blog
This is my first blog post!