python - How to efficiently parse fixed width files? -


i trying find efficient way of parsing files holds fixed width lines. example, first 20 characters represent column, 21:30 1 , on.

assuming line holds 100 characters, efficient way parse line several components?

i use string slicing per line, it's little bit ugly if line big. there other fast methods?

using python standard library's struct module easy extremely fast since it's written in c.

here's how used want. allows columns of characters skipped specifying negative values number of characters in field.

import struct  fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')                         fw in fieldwidths) fieldstruct = struct.struct(fmtstring) parse = fieldstruct.unpack_from print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))  line = 'abcdefghijklmnopqrstuvwxyz0123456789\n' fields = parse(line) print('fields: {}'.format(fields)) 

output:

fmtstring: '2s 10x 24s', recsize: 36 chars fields: ('ab', 'mnopqrstuvwxyz0123456789') 

the following modifications adapt work in python 2 or 3 (and handle unicode input):

import sys  fieldstruct = struct.struct(fmtstring) if sys.version_info[0] < 3:     parse = fieldstruct.unpack_from else:     # converts unicode input byte string , results unicode string     unpack = fieldstruct.unpack_from     parse = lambda line: tuple(s.decode() s in unpack(line.encode())) 

here's way string slices, considering concerned might ugly. nice thing is, besides not being ugly, works unchanged in both python 2 , 3, being able handle unicode strings. haven't benchmarked it, suspect might competitive struct module version speedwise. sped-up removing ability have padding fields.

try:     itertools import izip_longest  # added in py 2.6 except importerror:     itertools import zip_longest izip_longest  # name change in py 3.x  try:     itertools import accumulate  # added in py 3.2 except importerror:     def accumulate(iterable):         'return running totals (simplified version).'         total = next(iterable)         yield total         value in iterable:             total += value             yield total  def make_parser(fieldwidths):     cuts = tuple(cut cut in accumulate(abs(fw) fw in fieldwidths))     pads = tuple(fw < 0 fw in fieldwidths) # bool values padding fields     flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final 1     parse = lambda line: tuple(line[i:j] pad, i, j in flds if not pad)     # optional informational function attributes     parse.size = sum(abs(fw) fw in fieldwidths)     parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')                                                 fw in fieldwidths)     return parse  line = 'abcdefghijklmnopqrstuvwxyz0123456789\n' fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields parse = make_parser(fieldwidths) fields = parse(line) print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size)) print('fields: {}'.format(fields)) 

output:

format: '2s 10x 24s', rec size: 36 chars fields: ('ab', 'mnopqrstuvwxyz0123456789') 

Comments

Popular posts from this blog

Delphi Wmi Query on a Remote Machine -