[Image credit for Proto-Sinaitic ‘alp 𐤀 used in logo: here (CC-BY-2.5, Author: Pmx)]
Second Workshop on Computation and Written Language (CAWL 2024)

To be held in conjunction with LREC-COLING 2024
Torino, Italy, May 21, 2024

Annual CAWL workshops are organized under the guidance of the newly formed ACL Special Interest Group on Writing Systems and Written Language (SIGWrit).
View the schedule and proceedings from the first CAWL workshop at ACL in Toronto, 2023.

Contact: cawl.workshop.2024@gmail.com

What's the workshop about?
Most work on NLP focuses on language in its canonical written form. This has often led researchers to ignore the differences between written and spoken language or, worse, to conflate the two. Instances of conflation are statements like “Chinese is a logographic language" or “Persian is a right-to-left language", variants of which can be found frequently in the ACL anthology. These statements confuse properties of the language with properties of its writing system. Ignoring differences between written and spoken language leads, among other things, to conflating different words that are spelled the same (e.g., English bass), or treating as different, words that have multiple spellings (e.g., Japanese umai ‘tasty’, which can be written 旨い, うまい, ウマい, or 美味い).

Furthermore, methods for dealing with written language issues (e.g., various kinds of normalization or conversion) or for recognizing text input (e.g. OCR & handwriting recognition or text entry methods) are often regarded as precursors to NLP rather than as fundamental parts of the enterprise, despite the fact that most NLP methods rely centrally on representations derived from text rather than (spoken) language. This general lack of consideration of writing has led to much of the research on such topics to largely appear outside of ACL venues, in conferences or journals of neighboring fields such as speech technology (e.g., text normalization) or human-computer interaction (e.g., text entry).

Original call-for-papers: https://www.aclweb.org/portal/content/call-papers-second-workshop-computation-and-written-language-cawl-2024

Invited Speaker: Nizar Habash (NYU Abu Dhabi)
Title of talk: On Writing Arabic
Abstract: The Arabic language, broadly defined, encompases a diverse collection of varieties that are tied together historically and linguistically, but with a high degree of variations in terms of phonology, morphology, lexicon, and naturally orthography. In this talk we present a condensed summary of the challenges of writing Arabic and the evolution of different orthographic solutions to address them. The accumulation and persistence of different conventions have led to many co-existing orthographies today creating a complex space of challenges for computational modeling. Among the examples we discuss are subtle differences in Standard Arabic spelling across Arab countries, using scripts other than Arabic for writing Arabic dialects, and, most recently, social media experimentation with reverting to ancient orthographic conventions to fight AI censorship algorithms.

Schedule

9:00-9:10 Organizers Opening remarks
9:10-10:10 Invited speaker: Nizar Habash On Writing Arabic
10:10-10:30 Rayyan Merchant & Kevin Tang ParsText: A Digraphic Corpus for Tajik-Farsi Transliteration
10:30-11:00 Coffee Break
11:00-11:30   Invited talk: Jalal Maleki Balancing Linguistic Integrity and Practicality: The Design Journey of Dabire, a Romanized Writing System for Persian
11:30-12:00 Wieke Harmsen, Catia Cucchiarini, Roeland van Hout & Helmer Strik   A Joint Approach for Automatic Analysis of Reading and Writing Errors
12:00-12:20 Luna Peck & Susan Brown Tool for Constructing a Large-Scale Corpus of Code Comments and Other Source Code Annotations
12:20-2:00 Lunch break
2:00-2:30 Rastislav Hronsky & Emmanuel Keuleers Tokenization via Language Modeling: the Role of Preceding Text
2:30-2:50 Kyle Gorman & Brian Roark Abbreviation across the world's languages and scripts
2:50-3:20 Daan van Esch Now You See Me, Now You Don't: ‘Poverty of the Stimulus' Problems and Arbitrary Correspondences in End-to-End Speech Models
3:20-3:40 Logan Born, M. Willis Monroe, Kathryn Kelley & Anoop Sarkar Towards Fast Cognate Alignment on Imbalanced Data
3:40-4:00 Organizers SIGWrit business meeting
4:00-4:30 Coffee Break
4:30-4:50 Yixia Wang & Emmanuel Keuleers Simplified Chinese Character Distance Based on Ideographic Description Sequences
4:50-5:00 Organizers Closing remarks, discussion

Organization

Organizing Committee:
Program Committee:


Sponsorship

The 2024 CAWL workshop is supported by Google: