Find all Chinese text in a string using Python and Regex -


i needed strip chinese out of bunch of strings today , looking simple python regex. suggestions?

the short, relatively comprehensive answer narrow unicode builds of python (excluding ordinals > 65535 can represented in narrow unicode builds via surrogate pairs):

re = re.compile(u'[⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]', re.unicode) nochinese = re.sub('', mystring) 

the code building re, , if need detect chinese characters in supplementary plane wide builds:

# -*- coding: utf-8 -*- import re  lhan = [[0x2e80, 0x2e99],    # han #  [26] cjk radical repeat, cjk radical rap         [0x2e9b, 0x2ef3],    # han #  [89] cjk radical choke, cjk radical c-simplified turtle         [0x2f00, 0x2fd5],    # han # [214] kangxi radical one, kangxi radical flute         0x3005,              # han # lm       ideographic iteration mark         0x3007,              # han # nl       ideographic number 0         [0x3021, 0x3029],    # han # nl   [9] hangzhou numeral one, hangzhou numeral 9         [0x3038, 0x303a],    # han # nl   [3] hangzhou numeral ten, hangzhou numeral thirty         0x303b,              # han # lm       vertical ideographic iteration mark         [0x3400, 0x4db5],    # han # lo [6582] cjk unified ideograph-3400, cjk unified ideograph-4db5         [0x4e00, 0x9fc3],    # han # lo [20932] cjk unified ideograph-4e00, cjk unified ideograph-9fc3         [0xf900, 0xfa2d],    # han # lo [302] cjk compatibility ideograph-f900, cjk compatibility ideograph-fa2d         [0xfa30, 0xfa6a],    # han # lo  [59] cjk compatibility ideograph-fa30, cjk compatibility ideograph-fa6a         [0xfa70, 0xfad9],    # han # lo [106] cjk compatibility ideograph-fa70, cjk compatibility ideograph-fad9         [0x20000, 0x2a6d6],  # han # lo [42711] cjk unified ideograph-20000, cjk unified ideograph-2a6d6         [0x2f800, 0x2fa1d]]  # han # lo [542] cjk compatibility ideograph-2f800, cjk compatibility ideograph-2fa1d  def build_re():     l = []     in lhan:         if isinstance(i, list):             f, t =             try:                  f = unichr(f)                 t = unichr(t)                 l.append('%s-%s' % (f, t))             except:                  pass # narrow python build, can't use chars > 65535 without surrogate pairs!          else:             try:                 l.append(unichr(i))             except:                 pass      re = '[%s]' % ''.join(l)     print 're:', re.encode('utf-8')     return re.compile(re, re.unicode)  re = build_re() print re.sub('', u'美国').encode('utf-8') print re.sub('', u'blah').encode('utf-8') 

Comments

Popular posts from this blog

javascript - Enclosure Memory Copies -

php - Replacing tags in braces, even nested tags, with regex -