[Python-talk] Kent's Korner?

Ric Werme ewerme at comcast.net
Fri Oct 19 09:00:52 EDT 2007

> Bill Sconce wrote:
> > We do have certain ingredients for maybe solving a simple problem, and
> > using it as an enteraining program (i.e., Ted's screen scraper).
> > The proposal seems to be:
> >   1. Kent's Korner -- introduce the little library module, BeautifulSoup.
> >   2. Someone drives, putting code onto the wall as suggested by the group.
> I wonder if this might turn out better without me introducing BS? The 
> docs are pretty good and KK could easily turn into solving the problem 
> at hand which would take the fun out of the programming exercise.

Gee, I was going to suggest we all sit around and watch you write it.
Popcorn replaces cookies.

One thing I was thinking of using Beautiful Soup for was in code I need
for a Peridocals Mail permit to convert people's addresses to the ZIP+4
code.  See http://zip4.usps.com/zip4/welcome.jsp .  The result is a page
with a 200 line body (and 700 line head) and changes every so often, I think
in part to annoy bulk converters (I do a dozen or two a month).

Whenever they change it, I usually find out a few days before I print
labels, so it's a scramble to fish out the results from the new style.

I've generally used state machines to help hunt down the right data, but
BS might be easier and also ease hunting down error replies for
bogus addresses.  State machines are easy to hack-until-it-works-today,
but back in college Prof. Wulf showed that looping over the BLISS equivalent
of a C switch() could quite nicely give you most of the bad effects of
gotos.  Ever since I've thought less of state machines and better of gotos.

Current relevant code:

def parse(lines):
    addr = ''
    zip = ''
    zip4 = ''
    delpt = ''
    fo = open("zip4reply.html", "w")
    state = 0
    for i in range(len(lines)):
        l = lines[i]
        fo.write('%d: %s' % (state, l))
        if state == 0:
            if l.find('td headers="full"') >= 0:
                state = 1
        elif state == 1:	# ZIP is after spaces
            pos = l.find('  ')
            if pos >= 0:
                pos = l.find('-', pos)
                if pos > 0:
                    zip = l[pos-5 : pos]
                    zip4 = l[pos+1 : pos+5]
                    print 'zip is "%s, zip4 is "%s"' % (zip, zip4)
                    state = 2
        elif state == 2:	# Looking for delivery point
            if l.find('mailingIndustryPopup2') >= 0:
                state = 3
        elif state == 3:	# Skip line with county name
            state = 4
        elif state == 4:
            delpt = l[-6:-4]
            print 'delivery point is "%s"' % delpt
            state = 5
    if state > 5:	# ### Can't happen
        print 'Non-unique match?  State should be 5, is %d.' % state
        zip4 = ''
        delpt = ''
    if state < 5:
        print 'No match?  State should be 5, is %d.' % state
    postnet = zip+zip4+delpt
    if len(postnet) != 11:
        print 'Not all fields are right length!'
    return (zip, zip4, delpt, postnet)

Typical HTML to deal with is:
      <td headers="full" height="34" valign="top" class="main" 
style="background:url(images/table_gray.gif); padding:5px 10px;">

		83 N MAIN ST<br />

		<br />

      <td style="background:url(images/table_gray.gif);">&nbsp;</td>
      <td height="34" align="right" valign="top" class="main" 
style="background:url(images/table_gray.gif); padding:5px 10px;">
		<a title="Mailing Industry Information" href="#" 
			'Y');" >Mailing Industry Information</a>

Note I have to fish the "delivery point" (83) from a Javascript
subroutine call.  That gets used in the USPS bar codes (which I
currently don't generate).

> What do you think?

I'd be glad to use Kent's code.  :-)  However, I have no idea whether BS would
be a better choice for this task.  Speed of execution is not a concern,
speed of changing to new HTML is.

	-Ric Werme

More information about the Python-talk mailing list