XML Processing with Python

by Sean McGrath
December 06, 1999

 

As part of our XML'99 coverage, we are pleased to bring you this taster from the "Working with XML in Python" tutorial led by Sean McGrath.

Introduction

A century ago, when HTML and CGI ruled the waves, Perl dominated the Web programming scene. As the transition to XML on the Web gathers pace, competition for the hearts and minds of Web developers is heating up. One language attracting a lot of attention at the moment is Python.

In this article we will take a high level look at Python. We will use the time honored "Hello world" example program to illustrate the principle features of the language. We will then examine the XML processing capabilities of Python.

Python is free

Python is free . You will find downloadable source code plus pre-compiled executables on python.org . As you know, "free" is one of those words that is often heavily loaded on the Internet. Fear not. Python is free with a capital "F". You are free to do essentially anything you like with Python, including make commercial use of it or derivatives created from it.

 

Python is interpreted

Python is an interpreted language. Programs can execute directly from the plain text files that house them. Typically Python files have a .py extension. There is no compilation phase as far as the programmer is concerned. Just edit and run!

 

Python is portable

Python is portable . It runs on basically every computing platform of note, from mainframes to Palm Pilots and everything in between. Python uses a virtual machine architecture, similar in concept to Java's virtual machine. The Python interpreter "compiles" programs to virtual machine code on-the-fly. These compiled files (typically having a .pyc extension) are also portable. That is to say, if you wish to keep your source files hidden from your end-users you can simply ship the compiled .pyc files.

Python is easy to understand

Python is very easy to understand. Here is a Python program that prints the string "Hello world":

print "Hello world"

I think you will agree that programming a "Hello world" application cannot get much simpler than that! To execute this program, you put it in a text file, say Hello.py , and feed it to the Python interpreter like this:

python Hello.py

The output is, surprise, surprise:

Hello world

Note the complete lack of syntactic baggage in the Hello.py program. There are no mandatory keywords or semi-colons required to get this simple job done. This spartan, no-nonsense approach to syntax is one of the hallmarks of Python and applies equally well to large Python programs.

Python is interactive

By invoking the Python interpreter (typically by typing python on a UNIX/Linux system, or running the "IDLE" application on Windows), you will find yourself in an environment where you can execute Python statements interactively. As an example, here is the "Hello world" application again:

>>> print "Hello world"

This will output:

Hello world

Note that the ">>>" above is Python's command prompt. The interactive mode is an excellent environment for playing around with Python. It is also indispensable as a fully programmable calculator!

Python is WYSIWYG

Python is sometimes referred to as a WYSIWYG programming language. This is because the indentation of Python code controls how the code is executed. Python does not have begin/end keywords or braces for grouping code statements. It simply does not need them. Take a look at the following Python fragment:



The indentation of the code is used to control how statements are grouped for execution purposes. There can be no ambiguity as to which if clause is associated with the else clause in the above code because both statements have same level of indentation.

Functions in Python

We can turn the "Hello world" program into a Python function like this:



Note that statements within the body of a function are indented beneath the def Hello() line which introduces the function. The parenthesis are a place holder for function parameters. Here is a function that prints its parameters x and y as well as the string "Hello world":



Python modules

A Python program typically consists of a number of modules . Any Python source file can serve as a module and be imported into another Python program. For example, assuming the Hello function above is housed in the file Greeting.py we can import the function into a Python program and call it as follows:



Programs as modules to larger programs

Python makes it easy to write programs that can be used both as stand-alone programs and as modules to other programs.

Here is a modified version of Greeting.py which will print "Hello world" but can also still be imported into other programs:



Note the special __name__ variable above. This variable is automatically set to "__main__" when a program is being executed directly. If it is being imported into another program, __name__ is set to the name of the module, which in this case would be "Greeting".

Python is object-oriented

Python is a very object-oriented language. Here is an extended version of the "Hello world" program, called Message.py , that can print any message via MessageHolder objects:



Note how indentation is used to structure the source code. the getMsg function is associated with objects of the MessageHolder class because it is indented beneath the class MessageHolder . Functions associated with objects are more generally known as methods .

Suppose now that I need a variation on the MessageHolder class in which all messages are returned in uppercase. I can do that by subclassing MessageHolder , specifying the class I wish to inherit from in parentheses after the class name:



Python is extensible

The Python language consists of a small core and a large collection of modules. Some of these modules are written in Python and some are written in C. As a user of Python modules, you cannot tell the difference. For example:



The first statement imports Lars Marius Garshol's implementation of an XML parser that is written purely in Python. The second statement imports the Python wrapping of James Clark's expat XML parser which is written in C.

Python programs using these modules cannot tell what language they have been implemented in. As you would expect, programs based on expat are typically faster owing to the speed advantages of a pure C implementation of an XML parser.

It is remarkably easy to write a Python module in C. This facility is very useful for speed-critical parts of large Python systems. It is also easy to "wrap" existing C libraries as Python modules, as has been done with expat . Many technologies exposing a C API have been wrapped as Python modules, for example Oracle, the Win32 API, and the wxWindows GUI toolkit, to name a few.

XML programming support

The core Python distribution (currently at version 1.5.2) has a simple non-validating XML parser module called xmllib . The vast bulk of Python's XML support is in the form of an add-on module under active development by the SIG for XML Processing in Python (known as XML-SIG). To illustrate Python's XML support, we will switch to an XML 1.0 version of the "Hello world" program processing the following file:



SAX

SAX is a simple API for XML, spearheaded by David Megginson and developed as a collaborative effort on the XML-dev mail list. The Python implementation was developed by Lars Marius Garshol.

A Python SAX application to count the words in Greeting.xml looks like this:



DOM

The DOM is a W3C initiative to standardize an API to XML (and HTML) documents. Python has two DOM implementations. The one in the XML-SIG modules is the work of Andrew Kuchling and Stéfane Fermigier. The other is called 4DOM and is the work of Fourthought , who have also created XSLT and XPath implementations in Python.

Here is a sample DOM application to count the words in Greeting.xml :



Native Python APIs

As well as industry standard APIs, there is a native Python XML processing library known as Pyxie .

Pyxie is an open source XML processing library for Python which will be made publicly available in January 2000. Pyxie tries to make the best of Python's features to simplify XML processing.

Here is the word counting application developed using Pyxie:



In conclusion

We have looked at some of the main features of Python in a high level way. Also, we have glimpsed at some of the XML processing facilities available. For further information on programming with Python, I suggest you start with http://www.python.org .

Future articles on XML.com will taker a closer look at implementing XML applications in Python.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Introduction A lmost ten years ago, my introduction to the commercial world was handling and dealing with the databases of a UK government body. We used a free text-retrieval system to store information about software products, teachers, and trainers. Although we had fields for the information, when we entered a search, we were searching the whole document, not just one field or a collection of them. One of my jobs was to write programs that processed the information, deduced the fields, extracted compound addresses and telephone numbers, and tidied up the results to put into a new version of the database. Access to the database was through a Sun-based Unix system, and the PCs and Macs on the network didn’t talk to each other, but they did talk over Telnet to the Sun database server. You could do searches, edit information, and cut and paste, but you had no control over how the information looked without manually massaging the text you’d just copied. Ten Years Later Ten years later, for the most part I still do the same thing, although not with the same com- pany or the same information. Ultimately, though, I’m still working with databases and stor- age systems that rely on managing and dealing with a lot of text, but in a structured way that is somehow intelligent enough to know what I’m storing but flexible enough not to restrict what it is I want to store. The information I’m dealing with has to be accessible on a number of different platforms. In my home office alone, I’ve got Macs, Unix, Linux, at least five different versions of Win- dows, and handhelds running EPOC32 and PalmOS. They support different character sets, and I have to be able to convert the information into more usable formats, such as HTML for display, or stored in more rigid systems, like an RDBMS. What should I use? Do I play with a free text-retrieval system again? If I use a database system, how will I transfer my contacts from my desktop Mac to my portable Windows notebook or to Palm? If I want to view the information online, can I con- vert it easily? If I build an application that provides me with access to the information, how do I go about storing my preferences? How do I make the information available over the net- work in a format that can be accessed by all the machines that need to use it? 4021fm.qxd 11/2/01 4:27 PM Page xxi xxii The Solution The solution, if you haven’t guessed it already, is that I should use XML, the Extensible Markup Language. I get all the flexibility I need without losing any capabilities. I can add new fields, structures, and layouts to the information without breaking any of the existing tools. I can use fairly standard applications to convert the XML information into a more suit- able format. In fact, I can easily convert an XML document into a structured database, and I can query the database using SQL and export the records back in XML format. Alternatively, I can store everything in XML and access, process, and update the infor- mation directly. If I want, I can even query the XML document using XQL. I can use it to exchange information between platforms and, because all the information is in a standard and easily processed format, I should be able to use the information on any platform I have access to. The Tools I’m not actually a firm believer in being to able to specify the “right” tool for the job. Each programming job is different and may well have a number of different solutions and possible tools that would ease the process. However, I do know that scripting languages offer one of the fastest development environments, and many offer a wider range of supported platforms (and more accessible methods) than more traditional XML processing tools based on Java or C/C++. Python, for example, runs on MacOS, Unix, Windows and PalmOS. Rebol runs on even more. AppleScript is a standard part of every MacOS revision since 8.0 and is even included in MacOS X. Perl is supplied as standard with most Linux revisions, and even some commer- cial Unix installations include Perl as a standard option. In fact, I have access to a wider range and more easily accessible set of development tools off the shelf than any Java or C/C++ development environment I know of. Furthermore, some of them are so easy to use that it’s hard to understand why you would even look at another language. Did you know, for example, that you can talk to any application in MacOS with AppleScript? Think about what you could do if only you knew how to tell Perl to convert your XML- based documents into Word documents for editing, or to HTML for viewing on the Web, or to SQL tables for storage in a database! Introduction 4021fm.qxd 11/2/01 4:27 PM Page xxii xxiii This Book If you haven’t already guessed, this book is all about parsing, processing, and working with XML using a variety of scripting languages. After a brief XML refresher, I address the lan- guages in turn to show how each provides solutions for getting at the power of XML. Along the way, I address some of the important protocols, such as SOAP and XML-RPC, that make seamless data transfer possible. Throughout this book you’ll find sample scripts. You can download the complete versions of the scripts by going to www.sybex.comand following the link to the page for this book. XML seems to be everywhere today and used in a myriad of ways, especially in the vital and growing world of e-commerce. This book is designed to help you make the most of it. I hope that you will return to these pages often as you discover more uses for XML

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值