Customers started requesting sentiment analysis and other Natural Language Processing (NLP) functionality for Dutch approximately two years ago. But back then, we were still quite busy developing our NLP functionality for German, Spanish, and other major languages, so it’s been on the back burner – till now.
As Dutch is somewhat “in the middle” between English and German (both geographically and typologically) and since we have solid NLP functionality for those two languages, we realized porting our German NLP system to Dutch would be a solid option – no need to start the Dutch development from scratch. We hired a Dutch contractor who, with the help of the “architect” of the German dependency-syntax and sentiment-extraction rules, was tasked with “Netherlandizing” that system.
Porting in Practice
So what needs to change in a largely rule-based NLP system when it is ported from German to Dutch? Well, the morphological analyzer and part-of-speech tagger need to be changed because the orthographies of the two languages differ markedly. And the lexicons need to be translated and complemented with sentiment terms specific to Dutch (e.g., the very frequent kanker, whose original meaning is “cancer,” but which is now used similarly to the English f-word).
We drew mostly on the German rule resources, leaving the English ones aside for the most part, and we ported both English and German lexical resources to Dutch and combined the results.
Adapting the rules for syntactic analysis to take into account the main differences between the languages was essential. Between German and Dutch, the starkest differences are probably:
- The role that grammatical case plays in determining grammatical relations
- The relative order of verb forms in complex verb clusters
Grammatical Case Plays a Considerably Minor Role in Dutch
With regard to the former, Dutch is much closer to English than to German, i.e., instead of marking most noun phrases (NPs) for case, it does so almost exclusively for personal pronouns, and even there, it basically distinguishes only two grammatical cases (nominative and accusative/oblique) rather than four (nominative, genitive, dative, and accusative). As a result, NP-internal agreement plays a far more limited role in Dutch than in German, and the Dutch NP rules can be fewer in number and less complex than their German counterparts.
Inversely, however, we had to refine the rules establishing grammatical relations, such as subject and object, because grammatical case can be used as evidence for a certain relation only in the case of pronouns and hence other criteria like linear order and semantic features, such as animacy, need to be exploited more heavily.
Example sentence (1) illustrates a typical Dutch sentence expressing sentiment. It also shows that Dutch is structurally extremely similar to German, exhibiting verb-final constituent order in subclauses like Als ze willen skypen and verb-second order in the main clause.
The only way in which the word order in this Dutch sentence differs from its German translation is the relative order in the verb cluster willen skypen, which would be skypen wollen in German. The syntactic analysis, as well as the positive sentiment frame that we extract from it, are shown in the graphic below.
|“If they want to skype, I will definitely recommend an iPhone.”|
Interestingly, we also came across a phenomenon in Dutch that occurs much less frequently in German and English and hence had not yet received a systematic treatment in the respective grammars, namely, split prepositional phrases (PPs) consisting of a fronted daar/er (”there”), hier (”here”), or waar (”where”) and a “dangling” preposition. Example (2) illustrates this phenomenon.
Since we encountered this kind of PP a number of times in just a couple of hundred sentences from Dutch social media posts, we decided to come up with a sound syntactic analysis and thereby allow for the extraction of the correct sentiment for the right target from sentences like (2).
Like above, the following graphic shows the enriched dependency graph that our parser assigns to that sentence as well as the positive sentiment frame extracted from that graph.
|“I have now had, for a couple of years already, a Sennheiser HD650, which I am very satisfied with.”|
Developing New Rules to Fit Dutch
Finally, some work was, of course, necessary in the sentiment-extraction rule system as well. For example, it seems that Dutch uses more expressions involving hebben (“to have”) and some nouns to express sentiment than German. In order not to miss these expressions of sentiment, we thus had to develop new rules for sentences like (3), for which we extract hekel hebben aan (“dislike”/”abominate”) as the Sentiment role. (The graph below can only show one chunk per role, but internally all the relevant pieces are combined into hekel hebben aan.)
|I||have||such an||abomination||at||One Direction|
|“I so abominate One Direction.”|
Similarly, we had to take care of THE most prototypical expression of sentiment in the Dutch language, namely houden van (”to love”). Since houden by itself means “to hold/keep/retain,” we had to write a special rule that would make sure that (a) the sentiment is extracted only when houden is combined with a van-PP and (b) the extracted Sentiment role is houden van rather than just houden. Example (4) is a Dutch sentence causing that role to fire, as the following graphic illustrates.
|“How I love Metallica on days like these.”|
Porting a Parse-based System Beats Starting One from Scratch
All in all, it has become clear that porting a parse-based sentiment analysis system to a new language considerably reduces the necessary effort as compared to starting a system from scratch.
Dutch and German may be a particularly easy pair due to the close similarity of the languages, but similar efforts may well work among Romance (Spanish and Catalan, for example), Slavic, or Scandinavian languages. And, as an added benefit, the process also subjects the “parent” language to scrutiny from a new angle, as certain constructions expressing sentiment might be less frequent in it than in the “derived” language, but nevertheless occur in our social media data, even if, so far, they have gone unnoticed.
Hence, using this sped-up development process not only provides benefits for the “derived” language, but also offers potential for bidirectional cross-fertilization.
About the Author: Martin Forst is a Senior Computational Linguist at NetBase Solutions With both academic and industrial experience in the development of industry-strength computational grammars for several languages and the Deployment of these grammars in linguistic applications.