python - How to efficiently parse fixed width files? -

- April 15, 2011

i trying find efficient way of parsing files holds fixed width lines. example, first 20 characters represent column, 21:30 1 , on.

assuming line holds 100 characters, efficient way parse line several components?

i use string slicing per line, it's little bit ugly if line big. there other fast methods?

using python standard library's struct module easy extremely fast since it's written in c.

here's how used want. allows columns of characters skipped specifying negative values number of characters in field.

import struct  fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')                         fw in fieldwidths) fieldstruct = struct.struct(fmtstring) parse = fieldstruct.unpack_from print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))  line = 'abcdefghijklmnopqrstuvwxyz0123456789\n' fields = parse(line) print('fields: {}'.format(fields))

output:

fmtstring: '2s 10x 24s', recsize: 36 chars fields: ('ab', 'mnopqrstuvwxyz0123456789')

the following modifications adapt work in python 2 or 3 (and handle unicode input):

import sys  fieldstruct = struct.struct(fmtstring) if sys.version_info[0] < 3:     parse = fieldstruct.unpack_from else:     # converts unicode input byte string , results unicode string     unpack = fieldstruct.unpack_from     parse = lambda line: tuple(s.decode() s in unpack(line.encode()))

here's way string slices, considering concerned might ugly. nice thing is, besides not being ugly, works unchanged in both python 2 , 3, being able handle unicode strings. haven't benchmarked it, suspect might competitive struct module version speedwise. sped-up removing ability have padding fields.

try:     itertools import izip_longest  # added in py 2.6 except importerror:     itertools import zip_longest izip_longest  # name change in py 3.x  try:     itertools import accumulate  # added in py 3.2 except importerror:     def accumulate(iterable):         'return running totals (simplified version).'         total = next(iterable)         yield total         value in iterable:             total += value             yield total  def make_parser(fieldwidths):     cuts = tuple(cut cut in accumulate(abs(fw) fw in fieldwidths))     pads = tuple(fw < 0 fw in fieldwidths) # bool values padding fields     flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final 1     parse = lambda line: tuple(line[i:j] pad, i, j in flds if not pad)     # optional informational function attributes     parse.size = sum(abs(fw) fw in fieldwidths)     parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')                                                 fw in fieldwidths)     return parse  line = 'abcdefghijklmnopqrstuvwxyz0123456789\n' fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields parse = make_parser(fieldwidths) fields = parse(line) print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size)) print('fields: {}'.format(fields))

output:

format: '2s 10x 24s', rec size: 36 chars fields: ('ab', 'mnopqrstuvwxyz0123456789')

Search This Blog

Manage

python - How to efficiently parse fixed width files? -

Comments

Post a Comment

Popular posts from this blog

How do .net 4.0 [named] tuples work under the hood? -

php - How to build a web site which gives a sub-domain dynamically to every registered user? -

Delphi Wmi Query on a Remote Machine -