[Python-talk] PySIG this Thursday - Beautiful Soup - HTML scraping; actual code developed before your eyes

Bill Sconce sconce at in-spec-inc.com
Mon Oct 22 23:01:25 EDT 2007


PySIG                    Manchester, NH                  25 October 2007
------------------------------------------------------------------------
Kent Johnson: Beautiful Soup
Us Ourselves: Python in Action
------------------------------------------------------------------------

____________________________________________________________________
PySIG -- New Hampshire Python Special Interest Group
Amoskeag Business Incubator, Manchester, NH
25 October 2007 (4th Thursday)   7:00PM

The monthly meeting of PySIG, the NH Python Special Interest Group,
takes place on the fourth Thursday of the month, starting at 7:00 PM.
Beginners' session precedes at 6:30 PM.  (Bring a Python question!)

--------------------------------------------------------------------
Kent's Korner - Kent Johnson: Beautiful Soup
--------------------------------------------------------------------
"Beautiful Soup is a Python HTML/XML parser designed for quick
turnaround projects like screen-scraping. Three features make it
powerful:
    1. Beautiful Soup won't choke if you give it bad markup. It 
  yields a parse tree that makes approximately as much sense as
  your original document. This is usually good enough to collect
  the data you need and run away.
    2. Beautiful Soup provides a few simple methods and Pythonic
  idioms for navigating, searching, and modifying a parse tree: a
  toolkit fordissecting a document and extracting what you need. 
  You don't have to create a custom parser for each application.
    3. Beautiful Soup automatically converts incoming documents to
  Unicode and outgoing documents to UTF-8. You don't have to think
  about encodings, unless the document doesn't specify an encoding
  and Beautiful Soup can't autodetect one. Then you just have to
  specify the original encoding.
  
"Beautiful Soup parses anything you give it, and does the tree
traversal stuff for you. You can tell it 'Find all the links', or 
'Find all the links of class externalLink', or 'Find all the links 
whose urls match "foo.com"', or 'Find the table heading that's got
bold text, then give me that text.'

"Valuable data that was once locked up in poorly-designed websites
is now within your reach. Projects that would have taken hours take
only minutes with Beautiful Soup."

--------------------------------------------------------------------
1st-ever PySIG development sprint: we try to write Actual Code
--------------------------------------------------------------------
Per a Challenge from Ted Roche.  Viz,
    "Recently, I started messing with some of the data we have
    stored on GNHLUG.org, specifically, the Past Events page:
        http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents
    "Just like an end user, I had some questions, simple to ask,
    tough to answer. We have attendance data for most meetings
    since September of 2005. Given those dates to present,
      1. What's the average monthly attendance at a GNHLUG event?
      2. What's the average attendance over that period per
           group/chapter/SIG?
      3. What's the most popular meeting? 
           Most popular for each group?
      4. What's the trends in attendance? Up, down?
    "It seems like an interesting real-world problem: scrape a
    web page of questionable HTML, interpret dirty data (not all
    groups are groups, not all atttendance numbers are numbers),
    dump it into a database (or perhaps a spreadsheet?) and do 
    the calculations. Pretty graphs get extra points.
        "I think the results would be interesting, and the process
    of getting to the results interesting, too: presenting how you
    take on the problem, what tools you use, how much code is
    needed, would make a fun meeting not only inside the SIG, 
    but to LUG meetings as well."
        --posted to PySIG mailing list 26 Sept 07

And the group thought so too... so here we go!  We'll throw code
up on the screen, develop the screenscraper Ted envisions (or as
much of it as we can get done in the available time) and publish
the results.  A real world test of RAD, as empowered by Python
(and its "batteries-included" libraries -- in this case Beautiful
Soup).

All are welcome.  Come to help us code.  Or come to laugh  :)

Plus:
-------------------------------------------------------------------
o Kent's Korner   The Real Stuff!  - Kent Johnson
    This month: Beautiful Soup (see above)
    Upcoming Kent's Korner topics:
        XML parsing  (ElementTree)
        Profiling (timeit, prof)
        
o Our usual roundtable of introductions, happenings, announcements

o Gotcha contest
    - Got a favorite "gotcha"?  Bring it and share...

And of course, milk & cookies.
  Cookies are assured, thanks to Janet.  Milk also, thanks to Alex.
  
-------------------------------------------------------------------
6:30   Beginners' Q&A
7:00   Welcome, Announcements - Bill & Ted & Alex
7:10   Milk & Cookies - Alex & Janet
7:10   Favorite-gotcha contest
7:15   Kent's Korner (Python Module of the Month) - Beautiful Soup
7:45   Development Sprint!  Web-Scraping; the Ted Roche Challenge
9:00~  Adjourn

___________________________________________________________________
About PySIG:
    PySIG meetings are typically 10-20 people, around a large table
    equipped with a projector and Internet hookups (wired and
    wireless).  We encourage laptops and a hands-on seminar style.
    The main meeting starts at 7 PM; officially we finish circa 9 PM.  
    Everyone is welcome.  ("Membership" is anyone who has an interest
    in the Python progamming language, whether on Microsoft systems
    or Linux or OS X; or cell phones, mainframes, or space stations.
    We have everyone  from object-oriented gurus to recovering COBOL
    programmers.)  Tell your friends!
    
Beginners' session:
    The half hour before the formal meeting (i.e., starting at 6:30PM)
    we have a beginners' session.  Any Python question is welcome -- 
    whoever asks the first question gets the half hour!  Questions are
    equally welcome by mail beforehand (in which case we can announce
    them) or at the meeting.  (As are all Python questions, anytime.)

Mailing list:
    http://www.dlslug.org/mailman/listinfo/python-talk

About Python:
    "Python is a dynamic object-oriented programming language that
    can be used for many kinds of software development. It offers 
    strong support for integration with other languages and tools, 
    comes with extensive standard libraries, and can be learned
    in a few days.  Many Python programmers report substantial 
    productivity gains and feel the language encourages the 
    development of higher quality, more maintainable code."

    "NASA uses Python...so does Rackspace, Industrial Light&Magic,
    AstraZeneca, Honeywell, and many others."

    Google: "Python has been an important part of Google since the
    beginning, and remains so as the system grows and evolves." 
    -Peter Norvig
    
    http://www.python.org

About Amoskeag Business Incubator:
    Our gracious hosts are the Amoskeag Business Incubator, an
    organization providing a supportive entrepreneurial environment
    that stimulates the growth of businesses to ensure economic
    vitality and encourage job creation, by providing affordable
    office space and technical assistance to early stage companies.
    PySIG thanks the ABI for their generous hospitality.
    
    http://www.abi-nh.com

_______________________________________________________________________
Directions (thanks to Ted Roche for improvements to "from the north"):
    PySIG NH meetings are held at the Amoskeag Business Incubator,
    33 South Commercial Street, Manchester, NH.

    Coming in to Manchester using I-293, from the north:
        o Use Exit 6 from I-293.  Stay to the right on the ramp,
          yield twice to traffic incoming from the left, cross back
          over I-293 and accept one merge coming in from your right.
          
        o Then get in the right lane, and stay there, over the river,
          and onto the Canal Street exit ramp.
          
        o Take the first right off Canal Street onto North Commercial
          Street.  Enjoy the scenic mill buildings as the street turns
          into Commercial Street.
          
        o Coming to the traffic light get in the middle lane.  South
          Commercial Street starts on the other side of the light.
          You go straight through (and join the folks coming from the
          south at step * below).

    Coming in to Manchester using I-293, from the south:
        o Use the Granite Street exit.  Turn right (east).  Go under
          I-293 and cross the bridge over the Merrimack River.

        o Turn right (south) at the first light after crossing the
          bridge.

        * This is South Commercial Street.  Go past one parking-lot
          entrance, turn right into the second one.  33 Commercial
          Street will be right in front of you.  You may go in via
          either the ramp or the door and three steps inside.

        o Inside.  Up the stairs if via the door.  Go through the
          glass doors - follow the diamonds on the floor.  Go left 
          at the last diamond.   (Under a sign which says 
              "<- Amoskeag Small Bus. Incubator").

        o More diamonds, another sign...  much glass and office
          space for SNHU; turn left there, 4 more diamonds and 
          you're at the glass doors for the Incubator.  An "abi"
          sign is above.

        o Through the doors, straight down the hall.  The ABI
          Conference Room is on the left.

________________________________________________________________________
$URL: svn://svn.in-spec-inc.com/isi/trunk/isi/opages/pysig.announcement $
$Id: pysig.announcement 1570 2007-10-23 02:44:57Z sconce $    $Rev: 1570 $


More information about the Python-talk mailing list