Reference¶

This is the reference for the Wextracto web data extraction package.

Command¶

The wex command extracts data from HTTP-like responses. These responses can come from files, directories or URLs specified on the command line. The command calls any extractors that have been registered and writes any data extracted as output.

The output and input can be saved, using the --save or --save-dir command line arguments. This is useful for regression testing. existing extractor functions. The test are run using py.test.

For the complete list of command line arguments run:

$ wex --help

Registering Extractors¶

The simplest way to register extractors is to have a file named entry_points.txt in the current directory. This file should look something like this:

[wex]
.example.net = mymodule:extract_from_example_net

The [wex] section heading tells Wextracto that the following lines register extractors.

Extractors are registered using name = value pairs. If the name starts with . then the extractor is only applied to responses from domain names that match that name. Our example would match responses from www.example.net or example.net.

If the name does not start with . it will be applied responses whatever their domain.

You can register the same extractor against multiple domain names by having multiple lines with the same value but different names.

This is exactly the same format and content that you would use in the entry_points parameter for a setup function, if and when you want to package and your extractor functions.

Extractor¶

An extractor is a callable that returns or yields data. For example:

def extract(response):
    return "something"

The response parameter here is an instance of wex.response.Response.

Extractors can be combined in various ways.

class wex.extractor.Named(**kw)[source]¶

A extractor that is a collection of named extractors.

Extractors can be added to the collection on construction using keyword arguments for the names or they can be added using add().

The names are labels in the output produced. For example, an extractor function extract defined as follows:

extract = Named(
    name1 = (lambda response: "one"),
    name2 = (lambda response: "two"),
)

Would produce the extraction output something like this:

$ wex http://example.net/
"name1"    "one"
"name2"    "two"

The ordering of sub-extractor output is arbitrary.

add(extractor, label=None)[source]¶

Add an attribute extractor.

Parameters:	extractor (callable) – The extractor to be added. label (str) – The label for the extractor. This may be `None` in which case the extractors `__name__` attribute will be used.

This method returns the extractor added. This means it can also be used as a decorator. For example:

attrs = Named()

@attrs.add
def attr1(response):
    return "one"

wex.extractor.chained(*extractors)[source]¶

Returns an extractor that chains the output of other extractors.

The output is the output from each extractor in sequence.

Parameters:	extractors – an iterable of extractor callables to chain

For example an extractor function extract defined as follows:

def extract1(response):
    yield "one"

def extract2(response):
    yield "two"

extract = chained(extract1, extract2)

Would produce the following extraction output:

$ wex http://example.net/
"one"
"two"

wex.extractor.labelled(*args)[source]¶

Returns an extractor decorator that will label the output an extractor.

Parameters:	literals_or_callables – An iterable of labels or callables.

Each item in literals_or_callables may be a literal or a callable. Any callable will called with the same parameters as the extractor and whatever is returned will by used as a label.

For example an extractor function extract defined as follows:

def extract1(response):
    yield "one"

def label2(response):
    return "label2"

extract = label("label1", label2)(extract1)

Would produce the following extraction output:

$ wex http://example.net/
"label1"    "label2"    "one"

Note that if any of the labels are false then no output will be generated from that extractor.

wex.extractor.named(**kw)[source]¶: Returns a Named collection of extractors.

Element Tree¶

Composable functions for extracting data using lxml.

wex.etree.base_url_pair_getter(get_url)[source]¶

Returns a function for gettting a tuple of (base_url, url) when called with an etree Element or ElementTree.

In the returned pair base_url is the value returned from :func:get_base_url on the etree Element or ElementTree. There second value is the value returned by calling the get_url on the same the same etree Element or ElementTree, joined to the base_url using urljoin. This allows get_url to return a relative URL.

wex.etree.css(expression)[source]¶

Returns a composable callable that will select elements defined by a CSS selector expression.

Parameters:	expression – The CSS selector expression.

The callable returned accepts a wex.response.Response, a list of elements or an individual element as an argument.

wex.etree.drop_tree(*selectors)[source]¶: Return a function that will remove trees selected by selectors.

wex.etree.href_any_url¶: A wex.composed.ComposedFunction that returns the absolute URL from an href attribute.

wex.etree.href_url¶: A wex.composed.ComposedFunction that returns the absolute URL from an href attribute as long as it is from the same domain as the base URl of the response.

wex.etree.href_url_same_suffix¶: A wex.composed.ComposedFunction that returns the absolute URL from an href attribute as long as it is from the same public suffix as the base URl of the response.

wex.etree.itertext(*tags, **kw)[source]¶: Return a function that will return an iterator for text.

wex.etree.same_domain(url_pair)[source]¶: Return second url of pair if both are from same domain.

wex.etree.same_suffix(url_pair)[source]¶: Return second url of pair if both have the same public suffix.

wex.etree.src_url¶: A wex.composed.ComposedFunction that returns the absolute URL from an src attribute.

wex.etree.text¶: Alias for normalize-space | list2set

wex.etree.text_content¶: Return text content from an object (typically node-set) excluding from content from within <script> or <style> elements.

wex.etree.xpath(expression, namespaces={u're': u'http://exslt.org/regular-expressions'})[source]¶

Returns composable callable that will select elements defined by an XPath expression.

Parameters:	expression – The XPath expression. namespaces – The namespaces.

The callable returned accepts a wex.response.Response, a list of elements or an individual element as an argument.

For example:

>>> from lxml.html import fromstring
>>> tree = fromstring('<h1>Hello</h1>')
>>> selector = xpath('//h1')

Regular Expressions¶

wex.regex.re_group(pattern, group=1, flags=0)[source]¶

Returns a composable callable that extract the specified group using a regular expression.

Parameters:	pattern – The regular expression. group – The group from the MatchObject. flags – Flags to use when compiling the pattern.

wex.regex.re_groupdict(pattern, flags=0)[source]¶

Returns a composable callable that extract the a group dictionary using a regular expression.

Parameters:	pattern – The regular expression. flags – Flags to use when compiling the pattern.

String Functions¶

wex.string.partition(separator, **kw)[source]¶: Returns a function that yields tuples created by partitioning text using separator.

Iterables¶

Helper functions for things that are iterable

exception wex.iterable.MultipleValuesError[source]¶: More than one value was found when one or none were expected.

exception wex.iterable.ZeroValuesError[source]¶: Zero values were found when at least one was expected.

wex.iterable.islice(*islice_args)[source]¶: Returns a function that will perform itertools.islice on its input.

wex.iterable.one(iterable)[source]¶

Returns an item from an iterable of exactly one element.

If the iterable comprises zero elements then ZeroValuesError is raised. If the iterable has more than one element then MultipleValuesError is raised.

wex.iterable.one_or_none(iterable)[source]¶

Returns one item or None from an iterable of length one or zero.

If the iterable is empty then None is returned.

If the iterable has more than one element then MultipleValuesError is raised.

wex.iterable.first(iterable)[source]¶

Returns first item from an iterable.

Parameters:	iterable – The iterable.

If the iterable is empty then None is returned.

wex.iterable.flatten(iterable, yield_types)[source]¶: Yield objects from all sub-iterables from obj.

URLs¶

class wex.url.Method(scheme, name, args=None)[source]¶

Method objects ‘get’ responses from a url.

The Method object looks-up the correct implementation based on its name and the scheme of the url.

The default method name is ‘get’. Other method names can be specified in the fragment of the url.

get(url, **kw)[source]¶: Get responses for ‘url’.

class wex.url.URL[source]¶

URL objects.

fragment_dict¶: Client side data dict represented as JSON in the fragment.

get(**kw)[source]¶: Get url using the appropriate Method.

method¶: The Method for this URL.

Other Methods¶

class wex.form.ParserReadable(readable)[source]¶: Readable that feeds a parser as it is reads.

wex.form.form_values(self)[source]¶: Return a list of tuples of the field values for the form. This is suitable to be passed to urllib.urlencode().

class wex.ftp.RETRReadable(ftp, basename)[source]¶: Just like ftplib.FTP.retrbinary, but implements read and readline.

wex.ftp.close_on_empty(unbound)[source]¶

Calls ‘close’ on first argument when method return something falsey.

The first argument is presumed to the self.

wex.ftp.format_header()¶

S.format(*args, **kwargs) -> unicode

Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).

wex.ftp.format_status_line()¶

S.format(*args, **kwargs) -> unicode

Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).

wex.ftp.get(url, recipe, **kw)[source]¶: Recipe for an FTP get.

Functions for getting responses for HTTP urls.

wex.http.format_header()¶

S.format(*args, **kwargs) -> unicode

Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).

wex.http.format_status_line()¶

S.format(*args, **kwargs) -> unicode

Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).

wex.http.readable_from_response(response, url, decode_content, context)[source]¶: Make an object that is readable by Response.from_file.

wex.http.request(url, method, session=None, **kw)[source]¶: Makes an HTTP request following redirects.

Sitemaps¶

Extractors for URLs from /robots.txt and sitemaps.

wex.sitemaps.urls_from_robots_txt(response)[source]¶: Yields sitemap URLs from “/robots.txt”

wex.sitemaps.urls_from_sitemaps = Chained([<function urls_from_robots_txt at 0x7f9edd3d07d0>, <function urls_from_urlset_or_sitemapindex at 0x7f9edd3d0b18>])¶: Extractor that combines urls_from_robots_txt() and urls_from_urlset_or_sitemapindex().

wex.sitemaps.urls_from_urlset_or_sitemapindex(response)[source]¶: Yields URLs from <urlset> or <sitemapindex> elements as per sitemaps.org.

Response¶

class wex.response.Response(content, headers, url, code=None, **kw)[source]¶

A urllib2 style Response with some extras.

Parameters:

content – A file-like object containing the response content.
headers – An HTTPMessage containing the response headers.
url – The URL for which this is the response.
code – The status code recieved with this response.
protocol – The protocol received with this response.
version – The protocol version received with this response.
reason – The reason received with this response.
request_url – The URL requested that led to this response.

seek(offset=0, whence=0)[source]¶

Seek the content file position.

Parameters:	offset (int) – The offset from whence. whence (int) – 0=from start,1=from current position,2=from end

Composed¶

Wextracto uses Function composition as an easy way to build new functions from existing ones:

>>> from wex.composed import compose
>>> def add1(x):
...     return x + 1
...
>>> def mult2(x):
...     return x * 2
...
>>> f = compose(add1, mult2)
>>> f(2)
6

Wextracto uses the pipe operator, |, as a shorthand for function composition.

This shorthand can be a powerful technique for reducing boilerplate code when used in combination with named() extractors:

from wex.etree import css, text
from wex.extractor import named

attrs = named(title = css('h1') | text
              description = css('#description') | text)

class wex.composed.ComposedCallable(*functions)[source]¶

A callable, taking one argument, composed from other callables.

def mult2(x):
    return x * 2

def add1(x):
    return x + 1

composed = ComposedCallable(add1, mult2)

for x in (1, 2, 3):
    assert composed(x) == mult2(add1(x))

ComposedCallable objects are composable. It can be composed of other ComposedCallable objects.

wex.composed.composable(func)[source]¶

Decorates a callable to support function composition using |.

For example:

@Composable.decorate
def add1(x):
    return x + 1

def mult2(x):
    return x * 2

composed = add1 | mult2

wex.composed.compose(*functions)[source]¶: Create a ComposedCallable from zero more functions.

Output¶

Extracted data values are represented with tab-separated fields. The right-most field on each line is the value, all preceding fields are labels that describe the value. The labels and the value are all JSON encoded.

So for example, a value 9.99 with a labels product and price would look like:

"product"   "price" 9.99

And we could decode this line with the following Python snippet:

>>> import json
>>> line = '"product"\t"price"\t9.99\n'
>>> [json.loads(s) for s in line.split('\t')]
[u'product', u'price', 9.99]

Using tab-delimiters is convenient for downstream processing using Unix command line tools such as cut and grep.

URL Labelling¶

The convention for Wextracto is that any URL that should be downloaded is has the left-most label url. For example:

"url"   "http://example.net/some/url"

Data Labelling¶

If you are extracting multiple types of data (for example people and addresses) then a good labelling scheme is important.

It is a good idea to label the extracted values so that you can sort them easily using the Unix sort command.

An example of a labelling scheme that allows this would be:

{type}      {identifier}    {attribute}     {value}

So we might end up with output that look like this:

"person"    "http://example.net/person/1"   "name"  "Tom Bombadil"
"person"    "http://example.net/person/1"   "email" "tom1@example.net"
"address"   "http://example.net/address/2"  "city"  "New York"
"address"   "http://example.net/address/2"  "postal code"   "10001"
"person"    "http://example.net/person/3"   "name"  "Jack Sprat"
"person"    "http://example.net/person/3"   "email" "jack3@example.net"
"address"   "http://example.net/address/4"  "city"  "London"
"address"   "http://example.net/address/4"  "postal code"   "E14 5AB"

With output like this we can easily sort and group it.

Regression Tests¶

When maintaining extractors it can be helpful to have some sample input and output so that regression testing can be performed when we need to change the extractors.

Wextracto supports this by using the --save or --save-dir options to the wex command. This option saves both the input and output to a local directory.

This input and output can then be used for comparison with the current extractor output.

To check compare current output against saved output run py.test like so:

$ py.test saved/