Reference¶
This is the reference for the Wextracto web data extraction package.
Command¶
The wex command extracts data from HTTP-like responses.
These responses can come from files, directories or URLs specified on the
command line. The command calls any extractors that have
been registered and writes any data extracted
as output.
The output and input can be saved, using the --save or --save-dir
command line arguments. This is useful for
regression testing.
existing extractor functions.
The test are run using py.test.
For the complete list of command line arguments run:
$ wex --help
Registering Extractors¶
The simplest way to register extractors is to have
a file named entry_points.txt in the current directory. This file should
look something like this:
[wex]
.example.net = mymodule:extract_from_example_net
The [wex] section heading tells Wextracto that
the following lines register extractors.
Extractors are registered using name = value pairs.
If the name starts with . then the extractor is only applied to
responses from
domain names
that match that name.
Our example would match responses from www.example.net or example.net.
If the name does not start with . it will be applied responses whatever
their domain.
You can register the same extractor against multiple domain names by having multiple lines with the same value but different names.
This is exactly the same format and content that you would use in the
entry_points parameter for a
setup function,
if and when you want to package and your extractor functions.
Extractor¶
An extractor is a callable that returns or yields data. For example:
def extract(response):
return "something"
The response parameter here is an instance of
wex.response.Response.
Extractors can be combined in various ways.
-
class
wex.extractor.Named(**kw)[source]¶ A extractor that is a collection of named extractors.
Extractors can be added to the collection on construction using keyword arguments for the names or they can be added using
add().The names are labels in the output produced. For example, an extractor function
extractdefined as follows:extract = Named( name1 = (lambda response: "one"), name2 = (lambda response: "two"), )
Would produce the extraction output something like this:
$ wex http://example.net/ "name1" "one" "name2" "two"
The ordering of sub-extractor output is arbitrary.
-
add(extractor, label=None)[source]¶ Add an attribute extractor.
Parameters: - extractor (callable) – The extractor to be added.
- label (str) – The label for the extractor.
This may be
Nonein which case the extractors__name__attribute will be used.
This method returns the extractor added. This means it can also be used as a decorator. For example:
attrs = Named() @attrs.add def attr1(response): return "one"
-
-
wex.extractor.chained(*extractors)[source]¶ Returns an extractor that chains the output of other extractors.
The output is the output from each extractor in sequence.
Parameters: extractors – an iterable of extractor callables to chain For example an extractor function
extractdefined as follows:def extract1(response): yield "one" def extract2(response): yield "two" extract = chained(extract1, extract2)
Would produce the following extraction output:
$ wex http://example.net/ "one" "two"
-
wex.extractor.labelled(*args)[source]¶ Returns an extractor decorator that will label the output an extractor.
Parameters: literals_or_callables – An iterable of labels or callables. Each item in
literals_or_callablesmay be a literal or a callable. Any callable will called with the same parameters as the extractor and whatever is returned will by used as a label.For example an extractor function
extractdefined as follows:def extract1(response): yield "one" def label2(response): return "label2" extract = label("label1", label2)(extract1)
Would produce the following extraction output:
$ wex http://example.net/ "label1" "label2" "one"
Note that if any of the labels are false then no output will be generated from that extractor.
Element Tree¶
Composable functions for extracting data using lxml.
-
wex.etree.base_url_pair_getter(get_url)[source]¶ Returns a function for gettting a tuple of (base_url, url) when called with an etree Element or ElementTree.
In the returned pair base_url is the value returned from :func:get_base_url on the etree Element or ElementTree. There second value is the value returned by calling the get_url on the same the same etree Element or ElementTree, joined to the base_url using urljoin. This allows get_url to return a relative URL.
-
wex.etree.css(expression)[source]¶ Returns a
composablecallable that will select elements defined by a CSS selector expression.Parameters: expression – The CSS selector expression. The callable returned accepts a
wex.response.Response, a list of elements or an individual element as an argument.
-
wex.etree.drop_tree(*selectors)[source]¶ Return a function that will remove trees selected by selectors.
-
wex.etree.href_any_url¶ A
wex.composed.ComposedFunctionthat returns the absolute URL from anhrefattribute.
-
wex.etree.href_url¶ A
wex.composed.ComposedFunctionthat returns the absolute URL from anhrefattribute as long as it is from the same domain as the base URl of the response.
-
wex.etree.href_url_same_suffix¶ A
wex.composed.ComposedFunctionthat returns the absolute URL from anhrefattribute as long as it is from the same public suffix as the base URl of the response.
-
wex.etree.same_suffix(url_pair)[source]¶ Return second url of pair if both have the same public suffix.
-
wex.etree.src_url¶ A
wex.composed.ComposedFunctionthat returns the absolute URL from ansrcattribute.
-
wex.etree.text¶ Alias for normalize-space | list2set
-
wex.etree.text_content¶ Return text content from an object (typically node-set) excluding from content from within <script> or <style> elements.
-
wex.etree.xpath(expression, namespaces={u're': u'http://exslt.org/regular-expressions'})[source]¶ Returns
composablecallable that will select elements defined by an XPath expression.Parameters: - expression – The XPath expression.
- namespaces – The namespaces.
The callable returned accepts a
wex.response.Response, a list of elements or an individual element as an argument.For example:
>>> from lxml.html import fromstring >>> tree = fromstring('<h1>Hello</h1>') >>> selector = xpath('//h1')
Regular Expressions¶
-
wex.regex.re_group(pattern, group=1, flags=0)[source]¶ Returns a
composablecallable that extract the specified group using a regular expression.Parameters: - pattern – The regular expression.
- group – The group from the MatchObject.
- flags – Flags to use when compiling the pattern.
-
wex.regex.re_groupdict(pattern, flags=0)[source]¶ Returns a
composablecallable that extract the a group dictionary using a regular expression.Parameters: - pattern – The regular expression.
- flags –
Flags to use when compiling the pattern.
String Functions¶
Iterables¶
Helper functions for things that are iterable
-
exception
wex.iterable.MultipleValuesError[source]¶ More than one value was found when one or none were expected.
-
exception
wex.iterable.ZeroValuesError[source]¶ Zero values were found when at least one was expected.
-
wex.iterable.islice(*islice_args)[source]¶ Returns a function that will perform
itertools.isliceon its input.
-
wex.iterable.one(iterable)[source]¶ Returns an item from an iterable of exactly one element.
If the iterable comprises zero elements then
ZeroValuesErroris raised. If the iterable has more than one element thenMultipleValuesErroris raised.
-
wex.iterable.one_or_none(iterable)[source]¶ Returns one item or
Nonefrom an iterable of length one or zero.If the iterable is empty then
Noneis returned.If the iterable has more than one element then
MultipleValuesErroris raised.
URLs¶
Other Methods¶
-
wex.form.form_values(self)[source]¶ Return a list of tuples of the field values for the form. This is suitable to be passed to
urllib.urlencode().
-
class
wex.ftp.RETRReadable(ftp, basename)[source]¶ Just like ftplib.FTP.retrbinary, but implements read and readline.
-
wex.ftp.close_on_empty(unbound)[source]¶ Calls ‘close’ on first argument when method return something falsey.
The first argument is presumed to the self.
-
wex.ftp.format_header()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
-
wex.ftp.format_status_line()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
Functions for getting responses for HTTP urls.
-
wex.http.format_header()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
-
wex.http.format_status_line()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
Sitemaps¶
Extractors for URLs from /robots.txt and sitemaps.
-
wex.sitemaps.urls_from_sitemaps= Chained([<function urls_from_robots_txt at 0x7f9edd3d07d0>, <function urls_from_urlset_or_sitemapindex at 0x7f9edd3d0b18>])¶ Extractor that combines
urls_from_robots_txt()andurls_from_urlset_or_sitemapindex().
-
wex.sitemaps.urls_from_urlset_or_sitemapindex(response)[source]¶ Yields URLs from
<urlset>or<sitemapindex>elements as per sitemaps.org.
Response¶
-
class
wex.response.Response(content, headers, url, code=None, **kw)[source]¶ A urllib2 style Response with some extras.
Parameters: - content – A file-like object containing the response content.
- headers – An HTTPMessage containing the response headers.
- url – The URL for which this is the response.
- code – The status code recieved with this response.
- protocol – The protocol received with this response.
- version – The protocol version received with this response.
- reason – The reason received with this response.
- request_url – The URL requested that led to this response.
Composed¶
Wextracto uses Function composition as an easy way to build new functions from existing ones:
>>> from wex.composed import compose
>>> def add1(x):
... return x + 1
...
>>> def mult2(x):
... return x * 2
...
>>> f = compose(add1, mult2)
>>> f(2)
6
Wextracto uses the pipe operator, |, as a shorthand for function composition.
This shorthand can be a powerful technique for reducing boilerplate code when
used in combination with named() extractors:
from wex.etree import css, text
from wex.extractor import named
attrs = named(title = css('h1') | text
description = css('#description') | text)
-
class
wex.composed.ComposedCallable(*functions)[source]¶ A callable, taking one argument, composed from other callables.
def mult2(x): return x * 2 def add1(x): return x + 1 composed = ComposedCallable(add1, mult2) for x in (1, 2, 3): assert composed(x) == mult2(add1(x))
ComposedCallable objects are
composable. It can be composed of other ComposedCallable objects.
-
wex.composed.composable(func)[source]¶ Decorates a callable to support function composition using
|.For example:
@Composable.decorate def add1(x): return x + 1 def mult2(x): return x * 2 composed = add1 | mult2
-
wex.composed.compose(*functions)[source]¶ Create a
ComposedCallablefrom zero more functions.
Output¶
Extracted data values are represented with tab-separated fields. The right-most field on each line is the value, all preceding fields are labels that describe the value. The labels and the value are all JSON encoded.
So for example, a value 9.99 with a labels product and price would
look like:
"product" "price" 9.99
And we could decode this line with the following Python snippet:
>>> import json
>>> line = '"product"\t"price"\t9.99\n'
>>> [json.loads(s) for s in line.split('\t')]
[u'product', u'price', 9.99]
Using tab-delimiters is convenient for downstream processing using Unix command line tools such as cut and grep.
URL Labelling¶
The convention for Wextracto is that any URL that should be downloaded
is has the left-most label url. For example:
"url" "http://example.net/some/url"
Data Labelling¶
If you are extracting multiple types of data (for example people and addresses) then a good labelling scheme is important.
It is a good idea to label the extracted values so that you can sort them easily using the Unix sort command.
An example of a labelling scheme that allows this would be:
{type} {identifier} {attribute} {value}
So we might end up with output that look like this:
"person" "http://example.net/person/1" "name" "Tom Bombadil"
"person" "http://example.net/person/1" "email" "tom1@example.net"
"address" "http://example.net/address/2" "city" "New York"
"address" "http://example.net/address/2" "postal code" "10001"
"person" "http://example.net/person/3" "name" "Jack Sprat"
"person" "http://example.net/person/3" "email" "jack3@example.net"
"address" "http://example.net/address/4" "city" "London"
"address" "http://example.net/address/4" "postal code" "E14 5AB"
With output like this we can easily sort and group it.
Regression Tests¶
When maintaining extractors it can be helpful to have some sample input and output so that regression testing can be performed when we need to change the extractors.
Wextracto supports this by using the --save or --save-dir options to
the wex command. This option saves both the input and
output to a local directory.
This input and output can then be used for comparison with the current extractor output.
To check compare current output against saved output run py.test like so:
$ py.test saved/