python - How to efficiently parse fixed width files? -
i trying find efficient way of parsing files holds fixed width lines. example, first 20 characters represent column, 21:30 1 , on.
assuming line holds 100 characters, efficient way parse line several components?
i use string slicing per line, it's little bit ugly if line big. there other fast methods?
using python standard library's struct
module easy extremely fast since it's written in c.
here's how used want. allows columns of characters skipped specifying negative values number of characters in field.
import struct fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') fw in fieldwidths) fieldstruct = struct.struct(fmtstring) parse = fieldstruct.unpack_from print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size)) line = 'abcdefghijklmnopqrstuvwxyz0123456789\n' fields = parse(line) print('fields: {}'.format(fields))
output:
fmtstring: '2s 10x 24s', recsize: 36 chars fields: ('ab', 'mnopqrstuvwxyz0123456789')
the following modifications adapt work in python 2 or 3 (and handle unicode input):
import sys fieldstruct = struct.struct(fmtstring) if sys.version_info[0] < 3: parse = fieldstruct.unpack_from else: # converts unicode input byte string , results unicode string unpack = fieldstruct.unpack_from parse = lambda line: tuple(s.decode() s in unpack(line.encode()))
here's way string slices, considering concerned might ugly. nice thing is, besides not being ugly, works unchanged in both python 2 , 3, being able handle unicode strings. haven't benchmarked it, suspect might competitive struct
module version speedwise. sped-up removing ability have padding fields.
try: itertools import izip_longest # added in py 2.6 except importerror: itertools import zip_longest izip_longest # name change in py 3.x try: itertools import accumulate # added in py 3.2 except importerror: def accumulate(iterable): 'return running totals (simplified version).' total = next(iterable) yield total value in iterable: total += value yield total def make_parser(fieldwidths): cuts = tuple(cut cut in accumulate(abs(fw) fw in fieldwidths)) pads = tuple(fw < 0 fw in fieldwidths) # bool values padding fields flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final 1 parse = lambda line: tuple(line[i:j] pad, i, j in flds if not pad) # optional informational function attributes parse.size = sum(abs(fw) fw in fieldwidths) parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') fw in fieldwidths) return parse line = 'abcdefghijklmnopqrstuvwxyz0123456789\n' fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields parse = make_parser(fieldwidths) fields = parse(line) print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size)) print('fields: {}'.format(fields))
output:
format: '2s 10x 24s', rec size: 36 chars fields: ('ab', 'mnopqrstuvwxyz0123456789')
Comments
Post a Comment