Introduction
This document is directed to developers that want to invoke the html2xhtml service from their client programs using HTTP. It contains the technical details they need and code snippets to do it from a few programming languages.
The first versions of html2xhtml operated only in
command-line mode. Later, the capability to respond to
documents sent from HTML forms was added. The HTML document
to be converted and configuration parameters are received in
this mode through HTTP with the POST method and the
multipart/form-data
codification, as defined by
RFC
2388. Although very appropriate for
invoking the service from Web browsers, this mechanism has
two main drawbacks for developers that want to invoke it
from their own programs:
- The service generates an HTML page in which the resulting XHTML is embedded in escaped format (i.e., all the "<" and "&" characters are escaped with the appropriate entity references). This kind of output is convenient when the service is invoked from a browser, but is not very handy for developers.
-
It is simpler sending the input HTML directly embedded
in the body of the HTTP request than encoding it with
multipart/form-data
.
Html2xhtml provides now a solution to both problems. First, it accepts a parameter that controls whether the output XHTML document is sent directly or embedded inside a HTML page. And second, it accepts POST requests in which the input HTML document is provided as the body of the message (without any kind of encoding).
The rest of this document details how to invoke the service.
The Web API of html2xhtml has been released quite recently. Hence it has not been properly tested yet in a production environment. Consider it a beta service. I would be very grateful to get your feedback in order to improve the service.
Code snippets
Let's see first some pieces of code to call the service before explaining the technical details.
Invoking the service from shell scripts with curl
The html2xhtml service can be invoked from the shell and shell scripts using a command-line HTTP client like, for example, curl. It is free software and available for all the major operating systems.
The following piece of code invokes the html2xhtml service
to convert the local file foo.html
:
curl --data-binary @foo.html -H "Content-Type: text/html" http://www.it.uc3m.es/jaf/cgi-bin/html2xhtml.cgi
The output XHTML document will be written to standard output. Configuration parameters can be appended to the URL of the service after a question mark character (remember to wrap the URL with single quotation marks, because ampersand and quotation mark are reserved characters in BASH):
curl --data-binary @foo.html -H "Content-Type: text/html" \ 'http://www.it.uc3m.es/jaf/cgi-bin/html2xhtml.cgi?linelength=120&output-charset=utf-8'
For convenience, invocation to curl can be embedded in a shell script. The following is a simple example using BASH:
#!/bin/sh if [ ! $# -eq 1 ] then echo "Input HTML file name is expected as a command-line parameter" >&2 else curl --data-binary @$1 -H "Content-Type: text/html" http://www.it.uc3m.es/jaf/cgi-bin/html2xhtml.cgi fi
Note that the curl project includes also a library, libcurl, that can be invoked from C, with bindings available for many other programming languages.
Invoking the service from python
This is an example of a simple piece of python code that calls the html2xhtml service:
import sys import httplib, urllib headers = {"Content-type": "text/html", "Accept": "application/xhtml+xml"} params = urllib.urlencode({'tablength': 4, 'linelength': 100, 'output-charset': 'UTF-8'}) url = "/jaf/cgi-bin/html2xhtml.cgi?" + params # read the input HTML file (first command-line argument) in_file = open(sys.argv[1], 'r') in_data = in_file.read() # connect to the server and send the POST request conn = httplib.HTTPConnection("www.it.uc3m.es:80") conn.request("POST", url, in_data, headers) response = conn.getresponse() # show the result if response.status == 200: print response.read() else: print >> sys.stderr, response.status, response.reason conn.close()
Other programming languages
Of course, the service can be invoked from many other programming languages through convenient HTTP libraries. If you write a client and want to share your code with others, feel free to send the code snippet to me and I'll publish it here.
Configuration parameters
The behavior of the service can be controlled by a number of parameters that can be sent along with the HTTP request. All of them are encoded as name-value pairs. None of them are mandatory: a default value is assumed for each if not set in the request.
Parameter name | Accepted values | Default value | Description |
---|---|---|---|
type |
auto , transitional ,
strict , frameset , 1.1 ,
basic-1.0 , basic-1.1 ,
print-1.0 , mp
|
auto |
Controls the document type for the XHTML output.
The auto node tries to auto-detect
the most suitable document type (for example, when
the input has a document type declaration).
If not detected, defaults to XHTML 1.0 Transitional.
|
output |
plain , html |
plain for direct input,
html for multipart requests |
The output is directly the XHTML document resulting
from the conversion if plain is selected,
or it is embedded inside a HTML page if html
is selected instead.
|
dos-eol |
0, 1 | 0 |
If 1 , the output contains CRLF (DOS-like)
end of line characters. Otherwise, it contains
UNIX-like end of line characters.
|
tablength |
Integer between 0 and 16 | 2 | Indentation length (0 for no indentation). |
linelength |
Integer >= 40 | 80 |
Maximum length of lines. Html2xhtml will try to wrap the
output to the specified amount of characters.
Note that this is not always possible (for example
inside space-preserving elements (script ,
style and pre ) or when
there are no blanks where a line can be broken.
|
input-charset |
A character set alias | The input HTML document must be parsed assuming the character set specified. If no character set is specified, html2xhtml tries to guess it with a series of auto-detection mechanisms, including declarations inside the HTML input. Since auto-detection mechanisms seem to work fine for the usual character sets, this parameter is not normally necessary. Character have to be named using an alias defined in the official IANA list or the list of alias that the GNU libiconv library defines. The character set must be supported by the GNU iconv library. | |
output-charset |
A character set alias | The same as the input | The output HTML document is encoded with the specified character set. If no character set is specified, html2xhtml encodes the output with the same character set as the input. |
no-protect-cdata |
0, 1 | 0 | The default behavior of html2xhtml is to enclose CDATA sections using "//<!CDATA[[" and "//]]>", to make major browsers handle it properly even when processing the document in HTML (tag-soup) mode. Setting this option to 1 makes the program enclose CDATA sections in "script" and "style" fol‐ lowing the XHTML 1.0 specification (using "<!CDATA[[" and "]]>"). It might be incompatible with some browsers when they don't process the input in XML mode. |
preserve-space-comments |
0, 1 | 0 | set this option to 1 to preserve white spaces, tabulators and ends of lines in HTML comments. The default behavior is to re-arrange spacing. |
empty-elm-tags-always |
0, 1 | 0 | By default, empty element tags are written only for elements declared as empty in the DTD. Setting this option to value 1 makes any element not having content to be written with the empty element tag, even if it is not declared as empty in the DTD. This option may cause problems when the XHTML document is opened by browsers in HTML (tag soup) mode. |
compact-empty-elem-tags |
0, 1 | 0 | If set to 1, do not write a whitespace before the slash for empty element tags (i.e. write "<br/>" instead of the default "<br />"). Note that although both notations are well-formed in XML, the XHTML 1.0 standard recommends the latter to improve compatibility with legacy browsers. |
compact-block-elements |
0, 1 | 0 | When the option is set to 1, no white spaces or line breaks are written between the start tag of a block element and the start tag of its first enclosed inline element (or character data), and between the end tag of its last enclosed inline element (or character data) and the end tag of the block element. By default a new line character and indentation is written between them. |
When the input is encoded with
multipart/form-data
, parameters must also be
encoded as part of the multipart/form-data
stream.
However, when the HTML input is sent directly, parameters
must be encoded as specified in the HTML
4 specification for the
application/x-www-form-urlencoded
content type.
The data stream must be appended to the path in the request line
of the HTTP request. Examples are provided below.
The HTTP request
The HTTP request sent to the html2xhtml service must follow the following specification:
- HTTP protocol version: 1.0 or 1.1
- Method:
POST
- Host:
www.it.uc3m.es
- Path:
/jaf/cgi-bin/html2xhtml.cgi
Content-Type
header:text/html
orapplication/xhtml+xml
for direct input;multipart/form-data
plus boundary for multipart requests.Content-Length
header: not mandatory. If provided, the length in bytes of the body.- Body: the input HTML document "as-it-is", or
the
multipart/form-data
stream otherwise. In the latter case, the parameter name for the HTML part must behtml
, and it has to be the last part of the stream (i.e., all the configuration parameters must be encoded before in the multipart stream).
The following example shows the format of a direct request:
POST /jaf/cgi-bin/html2xhtml.cgi?type=transitional&tablength=2 HTTP/1.1 Host: www.it.uc3m.es Content-Type: text/html Content-Length: 111 <html> <head> <title>Hello World!</title> </head> <body> <h1>Hello World!</h1> </body> </html>
The HTTP response
The output of the html2xhtml service will have a 200 OK
status if the program succeeds, or the appropriate HTTP
error status otherwise. If the output was set to
plain
mode (the default for direct requests),
the body of the response will carry the XHTML result
"as-it-is". Note, however, that for HTTP/1.1 requests the
server may apply the chunked
transfer coding to
the body of the response, but the HTTP/1.1 compatible
library of your choice will probably handle it properly and
provide your application with the decoded body.
This is an example response:
HTTP/1.1 200 OK Date: Sat, 09 Jan 2010 02:18:59 GMT Server: Apache/1.3.34 (Debian) Content-Type: application/xhtml+xml; charset=iso-8859-1 <?xml version="1.0" encoding="iso-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" > <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title> Hello World! </title> </head> <body> <h1> Hello World! </h1> </body> </html>