FAQTs - Knowledge Base - View Entry - Language detection module

faqts : Computers : Programming : Languages : Python : Snippets

+ Search

Python Cookbook (lots of great snippets!),
Entry

Language detection module

Jul 5th, 2000 10:01
Nathan Wallace, Hans Nowak, Snippet 192, Dinu C. Gherman
"""
Packages: text;miscellaneous.mundane
"""
"""
Is there anything already like a function that I can pass an
arbitrary string and it will tell me wether it is written in
English, French, German, etc.? 
I imagine this could be rather simply implemented with some 
dicts containing common prefixes and suffixes (as well as most 
often used words like 'you', 'me', etc.) used in the respective 
natural language. One could then calculate some likelihood for 
the text to be in any of these languages classes and then return
the most likely one, or a list of them. I'm not sure, though,
how accents would be represented in a "portable" way (accross
multiple platforms), maybe in HTML...?
While writing this I started a small experiment to code what
I think about. Below you'll find what came out of it. Is there
anything more sophisticated out there, including a better scor-
ing/weighting method, maybe also for combinations of words, even 
handling accents, perhaps?
"""
# langdetect.py -- Detect a natural language of a written text.
import string
en, fr, de = 'en', 'fr', 'de'
wordDict = {
    'i':en, 'you':en, 'me':en, 'the':en, 'a':en, 
    'moi':fr, 'je':fr, 'toi':fr, 'vouz':fr, 'sur':fr, 'en':fr,
    'sie':de, 'ich':de, 'um':de, 'an':de, 'ab':de}
prefixDict = {
    'off':en, 'to':en, 'under':en, 'in':en, 'thou':en,
    'mont':fr, 'contr':fr, 'mal':fr,
    'ver':de, 'zu':de, 'los':de, 'gut':de}
suffixDict = {
    'son':en, 'day':en, 'ing':en, 'ly':en, 'ght':en,
    'ique':fr, 'tude':fr, 'ont':fr, 'nal':fr,
    'tung':de, 'heim':de, 'zeug':de}
punct = """.,!?"()[]{}!§$%&/*+#"""
trans = string.maketrans(punct, ' '*len(punct))
def detectLanguage(input):
    inp0 = string.lower(input)
    inp1 = string.translate(inp0, trans)
    inp2 = string.strip(inp1)
    inp3 = string.split(inp2, ' ')
    res = {en:0, fr:0, de:0}
    explain = {en:[], fr:[], de:[]}
    for word in inp3:
        try :
            v = wordDict[word]
            res[v] = res[v] + 1
            explain[v].append(word)
        except KeyError:
            pass
        for p in prefixDict.keys():
            try:
                wp = word[:len(p)]
                if p == wp:
                    prefixDict[wp]
                    res[v] = res[v] + 1
                    explain[v].append(word)
            except KeyError:
                pass
        for s in suffixDict.keys():
            try:
                ws = word[-len(s):]
                if s == ws:
                    suffixDict[ws]
                    res[v] = res[v] + 1
                    explain[v].append(word)
            except KeyError:
                pass
    return res, explain
for phrase in ("I am in a good mood today.", 
        "Je suis en plaine forme.",
        "Ich bin heute gut drauf."):
    result, explain = detectLanguage(phrase)
    print "Input:", phrase
    print "Hypothesis:", result       
    print "Reasons:", explain
    print
# Should print something like this:
#
# Input: I am in a good mood today.
# Hypothesis: {'en': 5, 'fr': 0, 'de': 0}
# Reasons: {'en': ['i', 'in', 'a', 'today', 'today'], 
#           'fr': [], 
#           'de': []}
#
# Input: Je suis en plaine forme.
# Hypothesis: {'en': 0, 'fr': 2, 'de': 0}
# Reasons: {'en': [], 'fr': ['je', 'en'], 'de': []}
#
# Input: Ich bin heute gut drauf.
# Hypothesis: {'en': 0, 'fr': 0, 'de': 2}
# Reasons: {'en': [], 'fr': [], 'de': ['ich', 'gut']}