Find all Chinese text in a string using Python and Regex -
i needed strip chinese out of bunch of strings today , looking simple python regex. suggestions?
the short, relatively comprehensive answer narrow unicode builds of python (excluding ordinals > 65535 can represented in narrow unicode builds via surrogate pairs):
re = re.compile(u'[⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]', re.unicode) nochinese = re.sub('', mystring)
the code building re, , if need detect chinese characters in supplementary plane wide builds:
# -*- coding: utf-8 -*- import re lhan = [[0x2e80, 0x2e99], # han # [26] cjk radical repeat, cjk radical rap [0x2e9b, 0x2ef3], # han # [89] cjk radical choke, cjk radical c-simplified turtle [0x2f00, 0x2fd5], # han # [214] kangxi radical one, kangxi radical flute 0x3005, # han # lm ideographic iteration mark 0x3007, # han # nl ideographic number 0 [0x3021, 0x3029], # han # nl [9] hangzhou numeral one, hangzhou numeral 9 [0x3038, 0x303a], # han # nl [3] hangzhou numeral ten, hangzhou numeral thirty 0x303b, # han # lm vertical ideographic iteration mark [0x3400, 0x4db5], # han # lo [6582] cjk unified ideograph-3400, cjk unified ideograph-4db5 [0x4e00, 0x9fc3], # han # lo [20932] cjk unified ideograph-4e00, cjk unified ideograph-9fc3 [0xf900, 0xfa2d], # han # lo [302] cjk compatibility ideograph-f900, cjk compatibility ideograph-fa2d [0xfa30, 0xfa6a], # han # lo [59] cjk compatibility ideograph-fa30, cjk compatibility ideograph-fa6a [0xfa70, 0xfad9], # han # lo [106] cjk compatibility ideograph-fa70, cjk compatibility ideograph-fad9 [0x20000, 0x2a6d6], # han # lo [42711] cjk unified ideograph-20000, cjk unified ideograph-2a6d6 [0x2f800, 0x2fa1d]] # han # lo [542] cjk compatibility ideograph-2f800, cjk compatibility ideograph-2fa1d def build_re(): l = [] in lhan: if isinstance(i, list): f, t = try: f = unichr(f) t = unichr(t) l.append('%s-%s' % (f, t)) except: pass # narrow python build, can't use chars > 65535 without surrogate pairs! else: try: l.append(unichr(i)) except: pass re = '[%s]' % ''.join(l) print 're:', re.encode('utf-8') return re.compile(re, re.unicode) re = build_re() print re.sub('', u'美国').encode('utf-8') print re.sub('', u'blah').encode('utf-8')
Comments
Post a Comment