html

The lxml.html tool set for HTML handling.

Package Contents

Classes

Classes(self,attributes) Provides access to an element’s class attribute as a set-like collection.
HtmlMixin()
_MethodFunc(self,name,copy=False,source_class=HtmlMixin) An object that represents a method on an element as a function;
HtmlComment()
HtmlElement()
HtmlProcessingInstruction()
HtmlEntity()
HtmlElementClassLookup(self,classes=None,mixins=None) A lookup scheme for HTML Element classes.
FormElement() Represents a <form> element.
FieldsDict(self,inputs)
InputGetter(self,form) An accessor that represents all the input fields in a form.
InputMixin() Mix-in for all input elements (input, select, and textarea)
TextareaElement() <textarea> element. You can get the name with .name and
SelectElement() <select> element. You can get the name with .name.
MultipleSelectOptions(self,select) Represents all the selected options in a <select multiple> element.
RadioGroup() This object represents several <input type=radio> elements
CheckboxGroup() Represents a group of checkboxes (<input type=checkbox>) that
CheckboxValues(self,group) Represents the values of the checked checkboxes in a group of
InputElement() Represents an <input> element.
LabelElement() Represents a <label> element.
HTMLParser(self,**kwargs) An HTML parser that is configured to return lxml.html Element
XHTMLParser(self,**kwargs) An XML parser that is configured to return lxml.html Element

Functions

__fix_docstring(s)
_unquote_match(s,pos)
_transform_result(typ,result) Convert the result back into the input type.
_nons(tag)
document_fromstring(html,parser=None,ensure_head_body=False,**kw)
fragments_fromstring(html,no_leading_text=False,base_url=None,parser=None,**kw) Parses several HTML elements, returning a list of elements.
fragment_fromstring(html,create_parent=False,base_url=None,parser=None,**kw) Parses a single HTML element; it is an error if there is more than
fromstring(html,base_url=None,parser=None,**kw) Parse the html, returning a single element/document.
parse(filename_or_url,parser=None,base_url=None,**kw) Parse a filename, URL, or file-like object into an HTML document
_contains_block_level_tag(el)
_element_name(el)
submit_form(form,extra_values=None,open_http=None) Helper function to submit a form. Returns a file-like object, as from
open_http_urllib(method,url,values)
html_to_xhtml(html) Convert all tags in an HTML tree to XHTML by moving them to the
xhtml_to_html(xhtml) Convert all tags in an XHTML tree to HTML by removing their
tostring(doc,pretty_print=False,include_meta_content_type=False,encoding=None,method=”html”,with_tail=True,doctype=None) Return an HTML string representation of the document.
open_in_browser(doc,encoding=None) Open the HTML document in a web browser, saving it to a temporary
Element(*args,**kw) Create a new HTML Element.
__fix_docstring(s)
_unquote_match(s, pos)
_transform_result(typ, result)

Convert the result back into the input type.

_nons(tag)
class Classes(attributes)

Provides access to an element’s class attribute as a set-like collection. Usage:

>>> el = fromstring('<p class="hidden large">Text</p>')
>>> classes = el.classes  # or: classes = Classes(el.attrib)
>>> classes |= ['block', 'paragraph']
>>> el.get('class')
'hidden large block paragraph'
>>> classes.toggle('hidden')
False
>>> el.get('class')
'large block paragraph'
>>> classes -= ('some', 'classes', 'block')
>>> el.get('class')
'large paragraph'
__init__(attributes)
add(value)

Add a class.

This has no effect if the class is already present.

discard(value)

Remove a class if it is currently present.

If the class is not present, do nothing.

remove(value)

Remove a class; it must currently be present.

If the class is not present, raise a KeyError.

__contains__(name)
__iter__()
__len__()
update(values)

Add all names from ‘values’.

toggle(value)

Add a class name if it isn’t there yet, or remove it if it exists.

Returns true if the class was added (and is now enabled) and false if it was removed (and is now disabled).

class HtmlMixin
set(key, value=None)

set(self, key, value=None)

Sets an element attribute. If no value is provided, or if the value is None, creates a ‘boolean’ attribute without value, e.g. “<form novalidate></form>” for form.set('novalidate').

classes()

A set-like wrapper around the ‘class’ attribute.

classes(classes)
base_url()

Returns the base URL, given when the page was parsed.

Use with urlparse.urljoin(el.base_url, href) to get absolute URLs.

forms()

Return a list of all the forms

body()

Return the <body> element. Can be called from a child element to get the document’s head.

head()

Returns the <head> element. Can be called from a child element to get the document’s head.

label()

Get or set any <label> element associated with this element.

label(label)
label()
drop_tree()

Removes this element from the tree, including its children and text. The tail text is joined to the previous element or parent.

drop_tag()

Remove the tag, but not its children or text. The children and text are merged into the parent.

Example:

>>> h = fragment_fromstring('<div>Hello <b>World!</b></div>')
>>> h.find('.//b').drop_tag()
>>> print(tostring(h, encoding='unicode'))
<div>Hello World!</div>

Find any links like <a rel="{rel}">...</a>; returns a list of elements.

find_class(class_name)

Find any elements with the given class name.

get_element_by_id(id, *default)

Get the first element in a document with the given id. If none is found, return the default argument if provided or raise KeyError otherwise.

Note that there can be more than one element with the same id, and this isn’t uncommon in HTML documents found in the wild. Browsers return only the first match, and this function does the same.

text_content()

Return the text content of the tag (and the text in any children).

cssselect(expr, translator="html")

Run the CSS expression on this element and its children, returning a list of the results.

Equivalent to lxml.cssselect.CSSSelect(expr, translator=’html’)(self) – note that pre-compiling the expression can provide a substantial speedup.

Make all links in the document absolute, given the base_url for the document (the full URL where the document came from), or if no base_url is given, then the .base_url of the document.

If resolve_base_href is true, then any <base href> tags in the document are used and removed from the document. If it is false then any such tag is ignored.

If handle_failures is None (default), a failure to process a URL will abort the processing. If set to ‘ignore’, errors are ignored. If set to ‘discard’, failing URLs will be removed.

resolve_base_href(handle_failures=None)

Find any <base href> tag in the document, and apply its values to all links found in the document. Also remove the tag once it has been applied.

If handle_failures is None (default), a failure to process a URL will abort the processing. If set to ‘ignore’, errors are ignored. If set to ‘discard’, failing URLs will be removed.

Yield (element, attribute, link, pos), where attribute may be None (indicating the link is in the text). pos is the position where the link occurs; often 0, but sometimes something else in the case of links in stylesheets or style tags.

Note: <base href> is not taken into account in any way. The link you get is exactly the link in the document.

Note: multiple links inside of a single text string or attribute value are returned in reversed order. This makes it possible to replace or delete them from the text string value based on their reported text positions. Otherwise, a modification at one text position can change the positions of links reported later on.

Rewrite all the links in the document. For each link link_repl_func(link) will be called, and the return value will replace the old link.

Note that links may not be absolute (unless you first called make_links_absolute()), and may be internal (e.g., '#anchor'). They can also be values like 'mailto:email' or 'javascript:expr'.

If you give base_href then all links passed to link_repl_func() will take that into account.

If the link_repl_func returns None, the attribute or tag text will be removed completely.

class _MethodFunc(name, copy=False, source_class=HtmlMixin)

An object that represents a method on an element as a function; the function takes either an element or an HTML string. It returns whatever the function normally returns, or if the function works in-place (and so returns None) it returns a serialized form of the resulting document.

__init__(name, copy=False, source_class=HtmlMixin)
__call__(doc, *args, **kw)
class HtmlComment
class HtmlElement
class HtmlProcessingInstruction
class HtmlEntity
class HtmlElementClassLookup(classes=None, mixins=None)

A lookup scheme for HTML Element classes.

To create a lookup instance with different Element classes, pass a tag name mapping of Element classes in the classes keyword argument and/or a tag name mapping of Mixin classes in the mixins keyword argument. The special key ‘*’ denotes a Mixin class that should be mixed into all Element classes.

__init__(classes=None, mixins=None)
lookup(node_type, document, namespace, name)
document_fromstring(html, parser=None, ensure_head_body=False, **kw)
fragments_fromstring(html, no_leading_text=False, base_url=None, parser=None, **kw)

Parses several HTML elements, returning a list of elements.

The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements.

base_url will set the document’s base_url attribute (and the tree’s docinfo.URL).

fragment_fromstring(html, create_parent=False, base_url=None, parser=None, **kw)

Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.

If create_parent is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is also allowed, as are multiple elements as result of the parsing.

Passing a base_url will set the document’s base_url attribute (and the tree’s docinfo.URL).

fromstring(html, base_url=None, parser=None, **kw)

Parse the html, returning a single element/document.

This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document.

base_url will set the document’s base_url attribute (and the tree’s docinfo.URL)

parse(filename_or_url, parser=None, base_url=None, **kw)

Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use parse(...).getroot() to get the document root.

You can override the base URL with the base_url keyword. This is most useful when parsing from a file-like object.

_contains_block_level_tag(el)
_element_name(el)
class FormElement

Represents a <form> element.

inputs()

Returns an accessor for all the input elements in the form.

See InputGetter for more information about the object.

fields()

Dictionary-like object that represents all the fields in this form. You can set values in this dictionary to effect the form.

fields(value)
_name()
form_values()

Return a list of tuples of the field values for the form. This is suitable to be passed to urllib.urlencode().

action()

Get/set the form’s action attribute.

action(value)
action()
method()

Get/set the form’s method. Always returns a capitalized string, and defaults to 'GET'

method(value)
submit_form(form, extra_values=None, open_http=None)

Helper function to submit a form. Returns a file-like object, as from urllib.urlopen(). This object also has a .geturl() function, which shows the URL if there were any redirects.

You can use this like:

form = doc.forms[0]
form.inputs['foo'].value = 'bar' # etc
response = form.submit()
doc = parse(response)
doc.make_links_absolute(response.geturl())

To change the HTTP requester, pass a function as open_http keyword argument that opens the URL for you. The function must have the following signature:

open_http(method, URL, values)

The action is one of ‘GET’ or ‘POST’, the URL is the target URL as a string, and the values are a sequence of (name, value) tuples with the form data.

open_http_urllib(method, url, values)
class FieldsDict(inputs)
__init__(inputs)
__getitem__(item)
__setitem__(item, value)
__delitem__(item)
keys()
__contains__(item)
__iter__()
__len__()
__repr__()
class InputGetter(form)

An accessor that represents all the input fields in a form.

You can get fields by name from this, with form.inputs['field_name']. If there are a set of checkboxes with the same name, they are returned as a list (a CheckboxGroup which also allows value setting). Radio inputs are handled similarly.

You can also iterate over this to get all input elements. This won’t return the same thing as if you get all the names, as checkboxes and radio elements are returned individually.

__init__(form)
__repr__()
__getitem__(name)
__contains__(name)
keys()
__iter__()
class InputMixin

Mix-in for all input elements (input, select, and textarea)

name()

Get/set the name of the element

name(value)
name()
__repr__()
class TextareaElement

<textarea> element. You can get the name with .name and get/set the value with .value

value()

Get/set the value (which is the contents of this element)

value(value)
value()
class SelectElement

<select> element. You can get the name with .name.

.value will be the value of the selected option, unless this is a multi-select element (<select multiple>), in which case it will be a set-like object. In either case .value_options gives the possible values.

The boolean attribute .multiple shows if this is a multi-select.

value()

Get/set the value of this select (the selected option).

If this is a multi-select, this is a set-like object that represents all the selected options.

value(value)
value()
value_options()

All the possible values this select can have (the value attribute of all the <option> elements.

multiple()

Boolean attribute: is there a multiple attribute on this element.

multiple(value)
class MultipleSelectOptions(select)

Represents all the selected options in a <select multiple> element.

You can add to this set-like option to select an option, or remove to unselect the option.

__init__(select)
options()

Iterator of all the <option> elements.

__iter__()
add(item)
remove(item)
__repr__()
class RadioGroup

This object represents several <input type=radio> elements that have the same name.

You can use this like a list, but also use the property .value to check/uncheck inputs. Also you can use .value_options to get the possible values.

value()

Get/set the value, which checks the radio with that value (and unchecks any other value).

value(value)
value()
value_options()

Returns a list of all the possible values.

__repr__()
class CheckboxGroup

Represents a group of checkboxes (<input type=checkbox>) that have the same name.

In addition to using this like a list, the .value attribute returns a set-like object that you can add to or remove from to check and uncheck checkboxes. You can also use .value_options to get the possible values.

value()

Return a set-like object that can be modified to check or uncheck individual checkboxes according to their value.

value(value)
value()
value_options()

Returns a list of all the possible values.

__repr__()
class CheckboxValues(group)

Represents the values of the checked checkboxes in a group of checkboxes with the same name.

__init__(group)
__iter__()
add(value)
remove(value)
__repr__()
class InputElement

Represents an <input> element.

You can get the type with .type (which is lower-cased and defaults to 'text').

Also you can get and set the value with .value

Checkboxes and radios have the attribute input.checkable == True (for all others it is false) and a boolean attribute .checked.

value()

Get/set the value of this element, using the value attribute.

Also, if this is a checkbox and it has no value, this defaults to 'on'. If it is a checkbox or radio that is not checked, this returns None.

value(value)
value()
type()

Return the type of this element (using the type attribute).

type(value)
checkable()

Boolean: can this element be checked?

checked()

Boolean attribute to get/set the presence of the checked attribute.

You can only use this on checkable input types.

checked(value)
class LabelElement

Represents a <label> element.

Label elements are linked to other elements with their for attribute. You can access this element with label.for_element.

for_element()

Get/set the element this label points to. Return None if it can’t be found.

for_element(other)
for_element()
html_to_xhtml(html)

Convert all tags in an HTML tree to XHTML by moving them to the XHTML namespace.

xhtml_to_html(xhtml)

Convert all tags in an XHTML tree to HTML by removing their XHTML namespace.

tostring(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method="html", with_tail=True, doctype=None)

Return an HTML string representation of the document.

Note: if include_meta_content_type is true this will create a <meta http-equiv="Content-Type" ...> tag in the head; regardless of the value of include_meta_content_type any existing <meta http-equiv="Content-Type" ...> tag will be removed

The encoding argument controls the output encoding (defauts to ASCII, with &#…; character references for any characters outside of ASCII). Note that you can pass the name 'unicode' as encoding argument to serialise to a Unicode string.

The method argument defines the output method. It defaults to ‘html’, but can also be ‘xml’ for xhtml output, or ‘text’ to serialise to plain text without markup.

To leave out the tail text of the top-level element that is being serialised, pass with_tail=False.

The doctype option allows passing in a plain string that will be serialised before the XML tree. Note that passing in non well-formed content here will make the XML output non well-formed. Also, an existing doctype in the document tree will not be removed when serialising an ElementTree instance.

Example:

>>> from lxml import html
>>> root = html.fragment_fromstring('<p>Hello<br>world!</p>')

>>> html.tostring(root)
b'<p>Hello<br>world!</p>'
>>> html.tostring(root, method='html')
b'<p>Hello<br>world!</p>'

>>> html.tostring(root, method='xml')
b'<p>Hello<br/>world!</p>'

>>> html.tostring(root, method='text')
b'Helloworld!'

>>> html.tostring(root, method='text', encoding='unicode')
u'Helloworld!'

>>> root = html.fragment_fromstring('<div><p>Hello<br>world!</p>TAIL</div>')
>>> html.tostring(root[0], method='text', encoding='unicode')
u'Helloworld!TAIL'

>>> html.tostring(root[0], method='text', encoding='unicode', with_tail=False)
u'Helloworld!'

>>> doc = html.document_fromstring('<p>Hello<br>world!</p>')
>>> html.tostring(doc, method='html', encoding='unicode')
u'<html><body><p>Hello<br>world!</p></body></html>'

>>> print(html.tostring(doc, method='html', encoding='unicode',
...          doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"'
...                  ' "http://www.w3.org/TR/html4/strict.dtd">'))
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><body><p>Hello<br>world!</p></body></html>
open_in_browser(doc, encoding=None)

Open the HTML document in a web browser, saving it to a temporary file to open it. Note that this does not delete the file after use. This is mainly meant for debugging.

class HTMLParser(**kwargs)

An HTML parser that is configured to return lxml.html Element objects.

__init__(**kwargs)
class XHTMLParser(**kwargs)

An XML parser that is configured to return lxml.html Element objects.

Note that this parser is not really XHTML aware unless you let it load a DTD that declares the HTML entities. To do this, make sure you have the XHTML DTDs installed in your catalogs, and create the parser like this:

>>> parser = XHTMLParser(load_dtd=True)

If you additionally want to validate the document, use this:

>>> parser = XHTMLParser(dtd_validation=True)

For catalog support, see http://www.xmlsoft.org/catalog.html.

__init__(**kwargs)
Element(*args, **kw)

Create a new HTML Element.

This can also be used for XHTML documents.