html2xhtml - HTTP invocation API (Beta)

Introduction

This document is directed to developers that want to invoke the html2xhtml service from their client programs using HTTP. It contains the technical details they need and code snippets to do it from a few programming languages.

The first versions of html2xhtml operated only in command-line mode. Later, the capability to respond to documents sent from HTML forms was added. The HTML document to be converted and configuration parameters are received in this mode through HTTP with the POST method and the multipart/form-data codification, as defined by RFC 2388. Although very appropriate for invoking the service from Web browsers, this mechanism has two main drawbacks for developers that want to invoke it from their own programs:

The service generates an HTML page in which the resulting XHTML is embedded in escaped format (i.e., all the "<" and "&" characters are escaped with the appropriate entity references). This kind of output is convenient when the service is invoked from a browser, but is not very handy for developers.
It is simpler sending the input HTML directly embedded in the body of the HTTP request than encoding it with multipart/form-data.

Html2xhtml provides now a solution to both problems. First, it accepts a parameter that controls whether the output XHTML document is sent directly or embedded inside a HTML page. And second, it accepts POST requests in which the input HTML document is provided as the body of the message (without any kind of encoding).

The rest of this document details how to invoke the service.

The Web API of html2xhtml has been released quite recently. Hence it has not been properly tested yet in a production environment. Consider it a beta service. I would be very grateful to get your feedback in order to improve the service.

Code snippets

Let's see first some pieces of code to call the service before explaining the technical details.

Invoking the service from shell scripts with curl

The html2xhtml service can be invoked from the shell and shell scripts using a command-line HTTP client like, for example, curl. It is free software and available for all the major operating systems.

The following piece of code invokes the html2xhtml service to convert the local file foo.html:

curl --data-binary @foo.html -H "Content-Type: text/html"  http://www.it.uc3m.es/jaf/cgi-bin/html2xhtml.cgi

The output XHTML document will be written to standard output. Configuration parameters can be appended to the URL of the service after a question mark character (remember to wrap the URL with single quotation marks, because ampersand and quotation mark are reserved characters in BASH):

curl --data-binary @foo.html -H "Content-Type: text/html"  \
     'http://www.it.uc3m.es/jaf/cgi-bin/html2xhtml.cgi?linelength=120&output-charset=utf-8'

For convenience, invocation to curl can be embedded in a shell script. The following is a simple example using BASH:

#!/bin/sh

if [ ! $# -eq 1 ]
then
    echo "Input HTML file name is expected as a command-line parameter" >&2
else
    curl --data-binary @$1 -H "Content-Type: text/html"  http://www.it.uc3m.es/jaf/cgi-bin/html2xhtml.cgi
fi

Note that the curl project includes also a library, libcurl, that can be invoked from C, with bindings available for many other programming languages.

Invoking the service from python

This is an example of a simple piece of python code that calls the html2xhtml service:

import sys
import httplib, urllib

headers = {"Content-type": "text/html",
           "Accept": "application/xhtml+xml"}
params = urllib.urlencode({'tablength': 4,
                           'linelength': 100,
                           'output-charset': 'UTF-8'})
url = "/jaf/cgi-bin/html2xhtml.cgi?" + params

# read the input HTML file (first command-line argument)
in_file = open(sys.argv[1], 'r')
in_data = in_file.read()

# connect to the server and send the POST request
conn = httplib.HTTPConnection("www.it.uc3m.es:80")
conn.request("POST", url, in_data, headers)
response = conn.getresponse()

# show the result
if response.status == 200:
    print response.read()
else:
    print >> sys.stderr, response.status, response.reason

conn.close()

Other programming languages

Of course, the service can be invoked from many other programming languages through convenient HTTP libraries. If you write a client and want to share your code with others, feel free to send the code snippet to me and I'll publish it here.

Configuration parameters

The behavior of the service can be controlled by a number of parameters that can be sent along with the HTTP request. All of them are encoded as name-value pairs. None of them are mandatory: a default value is assumed for each if not set in the request.

Parameter name	Accepted values	Default value	Description
`type`	`auto`, `transitional`, `strict`, `frameset`, `1.1`, `basic-1.0`, `basic-1.1`, `print-1.0`, `mp`	`auto`	Controls the document type for the XHTML output. The `auto` node tries to auto-detect the most suitable document type (for example, when the input has a document type declaration). If not detected, defaults to XHTML 1.0 Transitional.
`output`	`plain`, `html`	`plain` for direct input, `html` for multipart requests	The output is directly the XHTML document resulting from the conversion if `plain` is selected, or it is embedded inside a HTML page if `html` is selected instead.
`dos-eol`	0, 1	0	If `1`, the output contains CRLF (DOS-like) end of line characters. Otherwise, it contains UNIX-like end of line characters.
`tablength`	Integer between 0 and 16	2	Indentation length (0 for no indentation).
`linelength`	Integer >= 40	80	Maximum length of lines. Html2xhtml will try to wrap the output to the specified amount of characters. Note that this is not always possible (for example inside space-preserving elements (`script`, `style` and `pre`) or when there are no blanks where a line can be broken.
`input-charset`	A character set alias		The input HTML document must be parsed assuming the character set specified. If no character set is specified, html2xhtml tries to guess it with a series of auto-detection mechanisms, including declarations inside the HTML input. Since auto-detection mechanisms seem to work fine for the usual character sets, this parameter is not normally necessary. Character have to be named using an alias defined in the official IANA list or the list of alias that the GNU libiconv library defines. The character set must be supported by the GNU iconv library.
`output-charset`	A character set alias	The same as the input	The output HTML document is encoded with the specified character set. If no character set is specified, html2xhtml encodes the output with the same character set as the input.
`no-protect-cdata`	0, 1	0	The default behavior of html2xhtml is to enclose CDATA sections using "//<!CDATA[[" and "//]]>", to make major browsers handle it properly even when processing the document in HTML (tag-soup) mode. Setting this option to 1 makes the program enclose CDATA sections in "script" and "style" fol‐ lowing the XHTML 1.0 specification (using "<!CDATA[[" and "]]>"). It might be incompatible with some browsers when they don't process the input in XML mode.
`preserve-space-comments`	0, 1	0	set this option to 1 to preserve white spaces, tabulators and ends of lines in HTML comments. The default behavior is to re-arrange spacing.
`empty-elm-tags-always`	0, 1	0	By default, empty element tags are written only for elements declared as empty in the DTD. Setting this option to value 1 makes any element not having content to be written with the empty element tag, even if it is not declared as empty in the DTD. This option may cause problems when the XHTML document is opened by browsers in HTML (tag soup) mode.
`compact-empty-elem-tags`	0, 1	0	If set to 1, do not write a whitespace before the slash for empty element tags (i.e. write "<br/>" instead of the default "<br />"). Note that although both notations are well-formed in XML, the XHTML 1.0 standard recommends the latter to improve compatibility with legacy browsers.
`compact-block-elements`	0, 1	0	When the option is set to 1, no white spaces or line breaks are written between the start tag of a block element and the start tag of its first enclosed inline element (or character data), and between the end tag of its last enclosed inline element (or character data) and the end tag of the block element. By default a new line character and indentation is written between them.

When the input is encoded with multipart/form-data, parameters must also be encoded as part of the multipart/form-data stream.

However, when the HTML input is sent directly, parameters must be encoded as specified in the HTML 4 specification for the application/x-www-form-urlencoded content type. The data stream must be appended to the path in the request line of the HTTP request. Examples are provided below.

The HTTP request

The HTTP request sent to the html2xhtml service must follow the following specification:

HTTP protocol version: 1.0 or 1.1
Method: POST
Host: www.it.uc3m.es
Path: /jaf/cgi-bin/html2xhtml.cgi
Content-Type header: text/html or application/xhtml+xml for direct input; multipart/form-data plus boundary for multipart requests.
Content-Length header: not mandatory. If provided, the length in bytes of the body.
Body: the input HTML document "as-it-is", or the multipart/form-data stream otherwise. In the latter case, the parameter name for the HTML part must be html, and it has to be the last part of the stream (i.e., all the configuration parameters must be encoded before in the multipart stream).

The following example shows the format of a direct request:

POST /jaf/cgi-bin/html2xhtml.cgi?type=transitional&tablength=2 HTTP/1.1
Host: www.it.uc3m.es
Content-Type: text/html
Content-Length: 111

<html>
  <head>
    <title>Hello World!</title>
  </head>
  <body>
    <h1>Hello World!</h1>
  </body>
</html>

The HTTP response

The output of the html2xhtml service will have a 200 OK status if the program succeeds, or the appropriate HTTP error status otherwise. If the output was set to plain mode (the default for direct requests), the body of the response will carry the XHTML result "as-it-is". Note, however, that for HTTP/1.1 requests the server may apply the chunked transfer coding to the body of the response, but the HTTP/1.1 compatible library of your choice will probably handle it properly and provide your application with the decoded body. This is an example response:

HTTP/1.1 200 OK
Date: Sat, 09 Jan 2010 02:18:59 GMT
Server: Apache/1.3.34 (Debian)
Content-Type: application/xhtml+xml; charset=iso-8859-1

<?xml version="1.0" encoding="iso-8859-1"?>

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
      Hello World!
    </title>
  </head>
  <body>
    <h1>
      Hello World!
    </h1>
  </body>
</html>