COBHUNI and IT technology
COBHUNI and IT Technology
At the center of the COBHUNI project lies the analysis of the exegetical literature on Qur’an and
Hadith in order to identify links and overlaps between texts.
Given the vastness of material on the one hand and the complexity of the different steps of the
analysis on the other, the project will refer to digitalized arabic text material. This will be done in a format allowing the application of tools of computational linguistics to identify citation patterns on a larger and qualitatively different level than would be possible in the traditional way, the philological analysis only by a researcher. This process will then generate data which will be assessed and evaluated by the philologists. In turn, this data will help us to answer some of the central research questions as well as to develop further research questions.
There are three sources to generate the data:
- Already digitalized texts: openly accessible on the internet & digitalized during preliminary work
- Digitalization of published editions in accordance with EU copyright laws and the recommendations for digitalization by the Deutsche Forschungsgemeinschaft (DFG)
- Typing of text from selected digitalized manuscripts and of published text with extremely low printing quality
COBHUNI will try to generate as much data as possible through scanning and OCRing text material. We started to work with the open source software tesseract, but then decided to switch to another open source software kraken, developed by the digital humanities department of Universität Leipzig. The open source OCR system kraken has integrated Arabic at a now satisfactory level and provdes good results. During the preliminary research, we tested the open source softwares in comparison to one of the leading commercial softwares for OCR-ing Arabic. Successful OCR-ing depends heavily on the specific technical form of the scan. If an OCR-program does well with the scans of a certain text, it might perform poorly with the scans of another text using the same font and being published roughly at the same time. It is thus important to test the exact technical specifications of the scan with which the OCR can perform best in connection with a given font. This task has to be carried out for each print anew. After a series of tests and taking into account the specific requirements of COBHUNI´s workflow, we opted for a high quality book scanner Zeutschel os 15000 advanced plus.