ADHO DH2019 Workshop

ADHO DH2019 Workshop “Towards Multilingualism In Digital Humanities: Achievements, Failures And Good Practices In DH Projects With Non-latin Scripts”**

Date & Time 8th July 2019, 9:00 - 18:00

Workshop organizers: Martin Lee; martin.lee@fu-berlin.de Dr. Cosima Wagner; cosima.wagner@fu-berlin.de Freie Universität Berlin University Library/Campus Library (Germany)

Exposé

The one day workshop responds to the call for multilingualism and multiculturalism in the Digital Humanities (DH) and discusses achievements, failures and good practices in DH projects with non-latin scripts (NLS). We want to provide hands-on insight into Do’s and Dont’s in NLS context and identify possible transferable practices to other languages and disciplines in the sessions, building upon lessons learned in the workshop “NLS in multilingual (software) environments” held in 2018 at Freie Universität Berlin [1]. The main goal was and is to strengthen an international network of NLS practitioners and experts who develop, maintain and distribute specific NLS knowledge, regardless of their working affiliation in academia, libraries, museums, or elsewhere.

While Digital Humanities scholarship is a vibrant and growing field of research in many humanities and social sciences disciplines, it has also been criticized as being culturally and technologically biased (e.g. Fiormonte 2012; Mahony 2018). As a result, there is a lack of DH infrastructure suited for processing non-latin scripts. This is not only a cultural problem of the representation of DH research from non-anglophone countries but also for area studies disciplines within the so called “Western” academic world.

In their call for papers to the DH Asia conference 2018 at Stanford University the organisers stated that > “when we look at DH in Western Europe and the Americas, we find a vibrant intellectual environment in which even college and university undergraduates – let alone more advanced researchers – can download off-the-shelf analytical platforms and data corpora, and venture into new and cutting-edge research questions; while, in the context of Asian Studies, we find an environment in which many of the most basic elements of DH research remain underdeveloped or non-existent” (Mullaney 2017).

This is not only true for Asian Studies but also for academic disciplines like Egyptology, Arabic Studies, Jewish Studies and other disciplines conducting DH research in or with non-latin scripts.

While there have been recent activities to strengthen the collaboration and networking within the NLS-DH community [2] there is still a strong need for knowledge exchange on NLS suited DH tools, best practices and networking events.

For example, how can we raise the awareness for NLS specific aspects in dominant standardization committees like the Unicode Consortium (especially for scripts like hieratic signs which are not yet represented) or in (inter)national authority files (Getty Thesauri, WikiData, German National Authority File a.s.o.)?
How can we establish (new?) standards for multilingual metadata in NLS and how rigid or how flexible do they have to be?
How can the recognition rate of non-standard characters with OCR be improved?
How can multilingual/multiscript data from different sources be integrated and processed (semantic mapping, annotation, translation, NER, tagging etc.) in collaborative research platforms?

Furthermore, in line with the discourse on the digital transformation of academic research and teaching, the need for stronger collaboration rises. Especially in the case of externally funded research projects, specific knowledge on DH and NLS can seldom be held within the organisation and projects tools and platforms often cannot be maintained longer than the duration of the project. Instead, these responsibilities should be part of the service portfolio of research infrastructure institutions like libraries or data centers. However, these institutions are normally not equipped to support all languages and disciplines.

The workshop tackles these issues

by presentations with challenges, answers and recommendations of 15 experts conducting DH projects with Arabic, Chinese, Japanese, Korean (CJK) script sources and ancient Egyptian hieratic signs,
by providing time for group discussions and a wrap-up session for securing modes of documentation, future collaboration of NLS DH tools best practice and their transmission to research infrastructure institutions.

The presentations will address the following subjects and methods of NLS DH:

digital representation of manuscripts: OCR (Arabic, CJK)
digital representation of NLS (non-Unicode signs: Ancient Egyptian hieratic signs, Old South Arabian, CJK)
digital research infrastructures and virtual research environments for NLS DH projects (Arabic, CJK, Ancient Egyptian hieratic signs )
multilingual metadata and metadata in NLS (Arabic, CJK, Ancient Egyptian hieratic signs )
semantic web and linked (open) data in NLS (CJK)
text encoding and mark-up languages and NLS (Arabic, CJK, Ancient Egyptian hieratic signs )
data mining / text mining in NLS (Arabic, CJK)
NER, machine translation, annotation for NLS (Arabic, CJK)

Workshop Format

The first part of the one day workshop will give developers and researchers the chance to present their challenges, solutions and tools for NLS DH related problems and questions.

In the second part of the workshop, we will develop an organisational strategy for cooperation and collaboration among the NLS DH community through a working group session. To aid organisation, we will provide a Wiki that participants can use during and after the workshop to organise collaboration.

We envision the results of this workshop to be a cornerstone for building an institutionalized network of NLS DH practitioners with a joint knowledge management system (Wiki, Github etc.) and communication channels. Furthermore, we have initiated a NLS DH handbook and want to develop this into a living handbook to be maintained by the aforementioned NLS DH network. Contributions to the workshop are planned to be published in a special issue on NLS and DH in an open access journal (scheduled for the second half of 2019).

Notes

[1] For a workshop report in English see https://blogs.fu-berlin.de/bibliotheken/2019/01/18/workshop-nls2018/; the German version was published at DHd Blog: https://dhd-blog.org/?p=10669 .

[2] e.g. annual DH Asia conferences at Stanford University, see http://dhasia.org/ (USA) ; a Summer School in June 2019 on right2left issues at the Digital Humanities Summer Institute, see http://www.dhsi.org/events.php (Canada) ; a Workshop on “Nicht-lateinische Schriften in multilingualen Umgebungen: Forschungsdaten und Digital Humanities in den Regionalstudien” (Multilingual infrastructure and Non-latin scripts: Digital Humanities and Research Data in Area Studies) in June 2018 at Freie Universität Berlin/Campus Library (Germany).

Works cited

Fiormonte, Domenico (2012). Towards a Cultural Critique of the Digital Humanities. Historical Social Research / Historische Sozialforschung 37:3 (141), pp. 59-76. Mahony, Simon (2018 ). Cultural Diversity and the Digital Humanities. Fudan Journal of the Humanities and Social Sciences pp. 1-18. Springer. DOI: https://doi.org/10.1007/s40647-018-0216-0 (accessed 24 April 2019). Mullaney, Tom (2017): Call for proposals : Digital humanities Asia : Harnessing Digital Technologies to Advance the Study of the Non-Western World, 26-29 April 2018, Stanford University. https://carnetcase.hypotheses.org/3165 (accessed 24 April 2019).

Workshop timetable and abstracts

Download workshop timetable (without abstracts) in pdf format: https://box.fu-berlin.de/s/eXWfdzMqeSQZ5KZ

09:00 - 09:15 Introduction to the workshop

Martin Lee, Cosima Wagner; Freie Universität Berlin, University Library / Campus Library (Germany)

09:15 - 12:05 NLS & DH: discussing workflows

09:15 - 09:40: „Ancient Egyptian Hieratic Script – Aspects of Digital Paleography for a NLS“

Presenters: Svenja A. Gülden, Simone Gerhards, Tobias Konrad; Akademie der Wissenschaften und der Literatur | Mainz (Germany): Project “Altägyptische Kursivschriften” (Ancient Egyptian cursive scripts); Johannes Gutenberg University Mainz (Germany); Technische Universität Darmstadt (Germany)

Abstract The project aims to provide an online database and tool for paleographic research of ancient Egyptian cursive scripts (Gülden et al., 2017; Gerhards et al., 2018). In addition to the monumental hieroglyphs, different cursive scripts were used over approximately 3000 years. The so-called Hieratic Script was used for various genres such as literary, religious and administrative texts. Hieratic could be written with ink on papyrus, leather, wooden panels or other materials and even be carved in stone. The approximately 600-700 individual characters of the script can vary immense in their graphical appearance.

Until now, there are no satisfactory OCR techniques or any other pattern recognition tools that can be used for hieratic script. Therefore, we need other digital methods and formats to detect similar signs and analyze the script.

In the first workshop session, we would like to show the first steps of our workflow beginning with digitizing the ancient Egyptian sources, recording the single signs in the database with links to metadata and in connection to other characters, working our way up to/and ending with preparing the data for digital analysis and a repository. In the second part we will present preliminary methods for analyzing raster as well as vector graphics. Furthermore, we would like to discuss the needs of projects that deal with similar paleographic questions of NLS and the possibilities of data structure, analysis, front end visualization and the data repository.

Gerhards, S., Gülden, S. A., Konrad, T., Leuk, M., Rapp, A., and Verhoeven-van Elsbergen, U. (2018). Aus erster Hand – 3000 Jahre Kursivschrift der Pharaonenzeit digital analysiert. In: Vogeler, G. (ed.), DHd 2018. Kritik der digitalen Vernunft. Konferenzabstracts. Universität zu Köln 26. Februar bis 2. März 2018, Köln, pp. 357–359.

Gülden, S. A., Krause, C., and Verhoeven, U. (2017). Prolegomena zu einer digitalen Paläographie des Hieratischen. In: Busch, H., Fischer F., and Sahle P. (eds.), Kodikologie und Paläographie im digitalen Zeitalter 4, Schriften des Instituts für Dokumentologie und Editorik, Nordersteds, pp. 253–273.

09:40 - 10:05: “Arabic Script in Digital Humanities Research Software Engineering”

Presenters: Oliver Pohl, Jonas Müller-Laackmann (Berlin-Brandenburg Academy of Sciences and Humanities, Germany)

Abstract Arabic script is composed of consonants, long vowels and diacritical marks of which only very few have simple equivalents in Latin script. It is possible to write Arabic without any diacritics (rasm, also see rasmify) or with the diacritics signifying the different consonants, vowels and/or grammatical cases. While most modern operating systems and web browsers offer support for Standard Arabic (al-Fuṣḥā), they do not cover the numerous dialects and related ancient languages, which vary significantly from Standard Arabic. When storing text representations of these ancient languages (e.g. Ancient South Arabic, see ccdb-transliterator) in a relational database, it is necessary to ensure utf8mb4 support.

However, even MSA (Modern Standard Arabic) requires some specific features for digital representation. Since the Arabic script is written from right to left (RTL), user interfaces need to be reworked to match the RTL behavior patterns. Thanks to the flexbox CSS layout model, switching website layouts from LTR to RTL can be achieved by merely one additional CSS statement when implemented correctly.

Furthermore, most consonants and long vowels are written in ligatures. However, ligatures and other features of the Arabic script pose challenges regarding their digital representation. In their early stages, computers could not easily display Arabic script with ligatures, leading to a misrepresentation of the language. Although there are new CSS flags to enforce Arabic ligatures on websites when using the right font (e.g. Coranica, Amiri or Arabic Typesetting), adding tags within a word deletes the ligatures and thus changes the representation of the Arabic letters.

Although Arabic script can be represented as transliterated text in Latin script, there are numerous different transcription systems (e.g. DMG, Anglo-American) of which some are highly inconsistent. However, most transcription systems only cover written Arabic and are not necessarily useful for representing phonetically correct oral or dialectal language.

In the western academia, software to support the research around the Arabic language needs to be manually crafted. All the problems mentioned above, as well as the underrepresentation of the Arabic language in software solutions pose challenges for the long-term preservation of research data and research software. In our contribution we want to present the challenges and (our) solutions when creating research software in Digital Humanities projects handling Arabic script.

10:05 - 10:30 “SHINE: A Novel API Standard & Data Model to Facilitate the Granular Representation and Cross-referencing of Multi-lingual Textual Resources”

Presenters: Pascal Belouin, Sean Wang (Max Planck Institute for the History of Science; Department III, RISE Project "Research Infrastructure for the Study of Eurasia”, Berlin, Germany)

Abstract The granular representation and cross-referencing of online digital textual resources across editions in various languages in a machine-readable format is a complex problem, even more so when considering resources available in non-Latin scripts due to the large number of potential multilingual editions of the same text. Various solutions to this issue have been proposed over the years: one of the most notable initiatives is the Canonical Text Services protocol [1] (hereinafter referred to as CTS). A more recent ongoing project, the Distributed Text Services protocol [2], aims to propose a standard which addresses some of the limitations and issues encountered in the development and real-world use of CTS.

The main difficulties of designing such a protocol could be summarized as follows: First, what could be considered as a ‘discrete’ textual resource can be composed of numerous, hierarchical sub-parts which can potentially be written in different languages, down to the level of a single word. Furthermore, each of these sub-parts might have a been translated in various different languages, and therefore available in a multitude of different online editions. To facilitate the consumption of these textual resources by digital research tools, it is therefore crucial to have a robust yet flexible way to represent and reference these different editions.

Over the last two years, Department III of the Max Planck Institute for the History of Science has been working on a protocol called SHINE [3] which aims to tackle this issue by proposing a new approach to the representation of textual resources and their associated metadata. By combining a relatively rigid way of modelling textual resources with a highly flexible and hierarchical metadata scheme, we believe that we can propose a protocol that is both straightforward to implement and powerful enough to allow the representation and cross-referencing of any type of multilingual online textual resources.

[1] https://wiki.digitalclassicist.org/Canonical_Text_Services, accessed 27/03/2019 [2] https://github.com/distributed-text-services/specifications, accessed 27/03/2019 [3] https://rise.mpiwg-berlin.mpg.de/pages/doc_for_developers, accessed 27/03/2019

10:30 - 10:45 "Expectations and reality: developing an English-Japanese semantic web environment for the Late Hokusai research project"

Presenter: Stephanie Santschi (British Museum, Project “Late Hokusai: Thought, Technique, Society”, UK)

Abstract The AHRC-funded research project “Late Hokusai: Thought, Technique, Society”, which took place at the British Museum and SOAS, University of London between April 2016 and March 2019, investigated the later years of Japanese artist Katsushika Hokusai (1760-1849). The project wanted to understand how Hokusai’s context influenced his artworks, and was interested in how digital methodology can support understanding implicit connections and trends between the artist and his society, faith and technological setting. Using semantic web tools, it therefore developed a knowledge graph prototype. The graph connects XML-derived bilingual source records from institutional participants with relevant data on people, events or spatio-temporal coordinates. Semantic relationships between these entities are documented as CIDOC CRM conceptual models that the research project refined as part of its activities. Issues arising when linking values from records in different source languages, and the demands of a multilingual user interface created from this source data, bring the status quo of noting and processing multilingual information in museum repositories into question.

Reflecting on the learning curve of the research project, the paper revisits the implications of using Japanese-English collection databases as source records. It explains how semantic mapping challenges existing practices of noting and processing information; gives a rationale for prioritizing one language and script in a multilingual collaborative space like Hokusai ResearchSpace; and imagines how an ideal bilingual knowledge base for Hokusai research would function. By discussing the preconditions for researching and presenting content multilingually online, the paper aims to enrich a tacit knowledge base of NLS practitioners working with multilingual and multicultural environments.

10:45 - 11:00 Coffee Break

09:15 - 12:35 NLS & DH: discussing platforms

11:00 - 11:25 “Towards a Versatile Open-Source Ecosystem for Computational Arabic Literary Studies”

Presenters: Mahmoud Kozae and Dr. Jan Jacob van Ginkel (Freie Universität Berlin, Department of History and Cultural Studies, ERC Advanced Grant: “Kalīla and Dimna – AnonymClassic”, Principal Investigator: Prof. Dr. Beatrice Gründler; Germany)

Abstract In designing the digital support for the project Kalīla and Dimna – AnonymClassic, we found that it has been relatively difficult for some literary researchers to acquire the technical skills to work with TEI/XML. To solve this problem, we decided to develop a system that has an easy user interface for advanced data entry and does not require knowledge of XML. The system involves mechanisms for validating the entered data and assuring its integrity and homogeneity. Furthermore, the data schema design aims at storing the linguistic data in a format that will facilitate its processing by machine learning algorithms. Applying this process should lead to optimize automated tokenization and lemmatization procedures, which are critical issues in Arabic computational linguistics.

This system is realized in a web platform, which we are currently developing. The Platform consists of a set of tools for administrating project tasks, text description and criticism, visual layout analysis of pages, transcription and linguistic annotation of texts, and literary segmentation and annotation of passages. All of these tools entirely depend on manual entry by an experienced user and do not involve automation yet. The code for the platform will be available in separate modules and packages with thoroughly documented APIs. We aspire that our open-source standards and modules will eventually produce an ecosystem of tools in our field by being an optimal option for developers working in computational literary studies in either Arabic or other languages.

11:25 - 11:40 Curation Technologies for a Cultural Heritage Archive. Analysing and transforming the "Project Tongilbu" data set into an interactive curation workbench

Presenter: Peter Bourgonje (DFKI GmbH [German Research Center for Artificial Intelligence], Speech and Language Technology Lab, Germany)

Abstract We are developing a platform for generic curation technologies, using various NLP procedures, that is specifically targeted at, but not limited to, document collections that are too large for humans to (manually) read and go through. The aim then is to provide prototypical NLP tools like NER, Entity Linking, clustering and summarization in order to support rapid exploration of a data set.

In this particular submission, the data set in question is the result of "Project Tongilbu”, a report funded by the Korean Ministry of Re-unification, on the unification of East- and West-Germany in the 1990’s. The majority of the content in this data set is in German, with small parts in Korean. With the collection being a set of PDF files, we first apply OCR to extract machine-readable text. Focusing on German, we then apply an NER model trained on Wikipedia data, retrieve URIs of recognized entities in the GND (Gemeinsame Normdatei, a German database of entities with additional information), perform temporal analysis and cluster documents according to the retrieved entities they contain. This is then visualized in a curation dashboard.

Since support (in terms of tooling, but also training data) for Korean is limited, for the Korean texts we experiment with Machine Translation on the texts extracted from the PDFs, to then apply the German pipeline and project annotations back onto the original Korean text.

11:40 - 12:05 "Creating, Linking, Visualizing and Interpreting Chinese and Korean datasets with MARKUS Environment"

Presenters: Jing Hu, Leiden University, The Netherlands; Ba-ro Kim, Chung-Ang University, South Korea

Abstract MARKUS (https://dh.chinese-empires.eu/markus/) is a multi-faceted research platform which allows researchers to automatically detect and manually correct personal names, place names, time references, official titles, and any other user-supplied named entities and to export the results with links to integrated databases and toolkits for further analysis. It has been developed by modelling humanities research flows and allows researchers to switch between annotation, reading, exploration, analysis, and interpretation. MARKUS has been a pioneer role in non-European language digital scholarship. The Korean version (K-MARKUS) is the first systematic attempt to adapt the model to another language.

This presentation will focus on how MARKUS has been used by researchers and students in Chinese and Korean Studies. HU Jing and KIM Ba-ro will also discuss the data model used to resolve the limitations of data integration in the development of K-MARKUS. In the first part, HU Jing will introduce the main functionality of MARKUS, including primary source text discovery and import from textual databases, the automated and manual mark-up of default named entities and user-generated tags, keyword discovery, batch mark-up, linked Chinese, Korean, and Manchu reference materials, data curation, content filtering, data export, as well as the associated textual analysis and data visualization platforms linked with MARKUS. In the second part, KIM Ba-ro will share his expertise in data creation for the K-MARKUS platform. He proposes an ontology model on the basis of the concept of Linked Open Data to break through the limitations for data integration caused by confidentiality, inaccuracy, and schema difference. Lastly, HU Jing will demonstrate a pilot study on a set of Korean records Yŏnhaengnok (Chosŏn Travelogues to Beijing) by using MARKUS and K-MARKUS in combination to show how the multilingual dimension of MARKUS benefits transnational studies.

12:05 - 12:35 NLS & DH: discussing standards & OCR

12:05 - 12:20 “Multilingual research projects: Challenges (and possible solutions) for making use of standards, authority files, and character recognition”

Presenter: Matthias Arnold (Universität Heidelberg, Cluster of Excellence “Asia and Europe in a Global Context”, Heidelberg Research Architecture, Germany)

Abstract This summer, the new Centre of Asian and Transcultural Studies (CATS) opens its doors at the University of Heidelberg.[1] This research collaboratorium also features a strong digital section, comprising research data in various media across Asia from both, digital library and digital humanities research sides. However, providing data and metadata to a multilingual community is not always trivial. In my presentation I will take three use cases from CATS projects as examples for the challenges we face, and also introduce approaches to solve them.

Use case 1: Not all metadata standards are capable of encoding multilingual content in a sufficient way. Here I will take XML elements from the VRA Core 4 XML [2] metadata standard as examples, and present our extension of the standard [3, 4].

Use case 2: In western language source materials, for example newspapers after 1850, digitization also implies the production of an OCR version of the content (full text). Although results are not always perfect, funding agencies like the DFG formulated this processing as a mandatory step. For non-Latin script (NLS) material, this is not feasible. Not only are the OCR algorithms not yet good enough, or additional characters -like for emphasis- significantly disturb the processing, it is already the document layout recognition that fails. One example is the processing of Chinese newspapers from the first half of the 20th century [5], which will be used to illustrate the challenges. [6]

Use case 3: Connecting local databases to international authority files is a good practice to open up local databases. It not only helps to precisely identify local entities, but also makes it possible to use external data to enhance the local resource. On the other hand, it allows external parties to re-use local data, and also opens the way to enhance the authority databases with domain-specific data. Large international authorities, like the Getty Thesauri, or national authorities, like the German National Authority File (GND), tend to be less aware of non-western items, like concepts or agents. Projects linking their data systematically to these authority files can make an advantage out of their local knowledge and submit data back to the community. While contributing to the larger authorities may be a challenge itself, a bottom-up workflow with community-based systems like WikiData, or DBpedia can be a more feasible first step. [7]

[1] http://hra.uni-hd.de [2] https://www.loc.gov/standards/vracore/schemas.html [3] Application: VRA Core 4 XML Transform tool (.csv to XML): https://www.researchgate.net/project/VRA-Core-4-Transform-Tool [4] Schema extension: http://cluster-schemas.uni-hd.de/vra-strictCluster.xsd [5] Project database: https://uni-heidelberg.de/ecpo [6] ECPO presentation: https://www.slideshare.net/MatthiasArnold/early-chinese-periodicals-online-ecpo-from-digitization-towards-open-data-jadh2018 [7] Agent service presentation: https://www.slideshare.net/MatthiasArnold/transforming-data-silos-into-knowledge-early-chinese-periodicals-online-ecpo

12:20 - 12:35 “No text - no mining. And what about dirty OCR? Training, optimizing, and testing of OCR/KWS-Methods for Chinese Scripts”

Presenter: Amir Moghaddass (Freie Universität Berlin, Campus Library, Project “Alt-Sinica”, Germany)

Abstract For DH practitioners in the latin script world, some of the questions raised in this paper might appear to have been solved long ago. For the non-latin script (NLS) digital humanities, especially for libraries and other institutions engaged in NLS-digitization projects, a seemingly simple question still poses a considerable challenge: how can we produce machine-readable texts that allow for the application of scientifically verifiable DH-methods? The presentation will address this problem in relation to pre-modern Chinese in more detail. It will undertake a close examination of current digitization practices and their output to assess the scope, status and best practice of the methods applied.

The gap between the wealth of material from the premodern textual traditions of the non-western world, especially by Chinese, Japanese, Arabic and Persian authors, on the one hand and the small portion of texts from this body available in a digital form on the other, was aptly labeled the „Asia deficit within Digital Humanities“ (Mullaney, 2016). While this gap is noticeably closing with the availability of major libraries’digital collections of pre-modern Chinese texts, a severe quality problem is coming to light: the vast majority of OCRed NLS-texts is of a conspicuously low level of accuracy and, as a consequence, unusable for or even detrimental to valid computational analysis.

The paper will explore alternative approaches to OCR for NLS, text enrichment, and post-digitization correction to better facilitate scholarly access to digital historical texts.

Mullaney, Thomas S. (2016). Digital Humanities Q and A with Thomas S. Mullaney. The China Policy Institute blog (University of Nottingham). http://theasiadialogue.com/2016/06/06/digital-humanities-qa-with-tom-mullaney (accessed 25 March 2019).

## 12:35 - 13:00 Discussion and wrap-up of the first part of the workshop

13:00 - 14:00 LUNCH

14:00 - 15:30 Hands-on Session

up to 9 stations: tools, platforms, workflows

15:30 - 15:45 Coffee Break

15:45 - 17:15 Working group session:

Case for action: next steps and infrastructure for NLS Market place: definition and selection of topics, to be discussed by presenters and participants of the workshop

Multilingual DH

Resources for doing digital humanities in many languages