Reference¶
This is the reference for the Wextracto web data extraction package.
Command¶
The wex
command extracts data from HTTP-like responses.
These responses can come from files, directories or URLs specified on the
command line. The command calls any extractors
that have
been registered
and writes any data extracted
as output
.
The output and input can be saved, using the --save
or --save-dir
command line arguments. This is useful for
regression testing.
existing extractor functions.
The test are run using py.test
.
For the complete list of command line arguments run:
$ wex --help
Registering Extractors¶
The simplest way to register extractors
is to have
a file named entry_points.txt
in the current directory. This file should
look something like this:
[wex]
.example.net = mymodule:extract_from_example_net
The [wex]
section heading tells Wextracto that
the following lines register extractors.
Extractors are registered using name = value
pairs.
If the name starts with .
then the extractor is only applied to
responses from
domain names
that match that name.
Our example would match responses from www.example.net
or example.net
.
If the name does not start with .
it will be applied responses whatever
their domain.
You can register the same extractor against multiple domain names by having multiple lines with the same value but different names.
This is exactly the same format and content that you would use in the
entry_points
parameter for a
setup function,
if and when you want to package and your extractor functions.
Extractor¶
An extractor is a callable that returns or yields data. For example:
def extract(response):
return "something"
The response
parameter here is an instance of
wex.response.Response
.
Extractors can be combined in various ways.
-
class
wex.extractor.
Named
(**kw)[source]¶ A extractor that is a collection of named extractors.
Extractors can be added to the collection on construction using keyword arguments for the names or they can be added using
add()
.The names are labels in the output produced. For example, an extractor function
extract
defined as follows:extract = Named( name1 = (lambda response: "one"), name2 = (lambda response: "two"), )
Would produce the extraction output something like this:
$ wex http://example.net/ "name1" "one" "name2" "two"
The ordering of sub-extractor output is arbitrary.
-
add
(extractor, label=None)[source]¶ Add an attribute extractor.
Parameters: - extractor (callable) – The extractor to be added.
- label (str) – The label for the extractor.
This may be
None
in which case the extractors__name__
attribute will be used.
This method returns the extractor added. This means it can also be used as a decorator. For example:
attrs = Named() @attrs.add def attr1(response): return "one"
-
-
wex.extractor.
chained
(*extractors)[source]¶ Returns an extractor that chains the output of other extractors.
The output is the output from each extractor in sequence.
Parameters: extractors – an iterable of extractor callables to chain For example an extractor function
extract
defined as follows:def extract1(response): yield "one" def extract2(response): yield "two" extract = chained(extract1, extract2)
Would produce the following extraction output:
$ wex http://example.net/ "one" "two"
-
wex.extractor.
labelled
(*args)[source]¶ Returns an extractor decorator that will label the output an extractor.
Parameters: literals_or_callables – An iterable of labels or callables. Each item in
literals_or_callables
may be a literal or a callable. Any callable will called with the same parameters as the extractor and whatever is returned will by used as a label.For example an extractor function
extract
defined as follows:def extract1(response): yield "one" def label2(response): return "label2" extract = label("label1", label2)(extract1)
Would produce the following extraction output:
$ wex http://example.net/ "label1" "label2" "one"
Note that if any of the labels are false then no output will be generated from that extractor.
Element Tree¶
Composable functions for extracting data using lxml.
-
wex.etree.
base_url_pair_getter
(get_url)[source]¶ Returns a function for gettting a tuple of (base_url, url) when called with an etree Element or ElementTree.
In the returned pair base_url is the value returned from :func:get_base_url on the etree Element or ElementTree. There second value is the value returned by calling the get_url on the same the same etree Element or ElementTree, joined to the base_url using urljoin. This allows get_url to return a relative URL.
-
wex.etree.
css
(expression)[source]¶ Returns a
composable
callable that will select elements defined by a CSS selector expression.Parameters: expression – The CSS selector expression. The callable returned accepts a
wex.response.Response
, a list of elements or an individual element as an argument.
-
wex.etree.
drop_tree
(*selectors)[source]¶ Return a function that will remove trees selected by selectors.
-
wex.etree.
href_any_url
¶ A
wex.composed.ComposedFunction
that returns the absolute URL from anhref
attribute.
-
wex.etree.
href_url
¶ A
wex.composed.ComposedFunction
that returns the absolute URL from anhref
attribute as long as it is from the same domain as the base URl of the response.
-
wex.etree.
href_url_same_suffix
¶ A
wex.composed.ComposedFunction
that returns the absolute URL from anhref
attribute as long as it is from the same public suffix as the base URl of the response.
-
wex.etree.
same_suffix
(url_pair)[source]¶ Return second url of pair if both have the same public suffix.
-
wex.etree.
src_url
¶ A
wex.composed.ComposedFunction
that returns the absolute URL from ansrc
attribute.
-
wex.etree.
text
¶ Alias for normalize-space | list2set
-
wex.etree.
text_content
¶ Return text content from an object (typically node-set) excluding from content from within <script> or <style> elements.
-
wex.etree.
xpath
(expression, namespaces={u're': u'http://exslt.org/regular-expressions'})[source]¶ Returns
composable
callable that will select elements defined by an XPath expression.Parameters: - expression – The XPath expression.
- namespaces – The namespaces.
The callable returned accepts a
wex.response.Response
, a list of elements or an individual element as an argument.For example:
>>> from lxml.html import fromstring >>> tree = fromstring('<h1>Hello</h1>') >>> selector = xpath('//h1')
Regular Expressions¶
-
wex.regex.
re_group
(pattern, group=1, flags=0)[source]¶ Returns a
composable
callable that extract the specified group using a regular expression.Parameters: - pattern – The regular expression.
- group – The group from the MatchObject.
- flags – Flags to use when compiling the pattern.
-
wex.regex.
re_groupdict
(pattern, flags=0)[source]¶ Returns a
composable
callable that extract the a group dictionary using a regular expression.Parameters: - pattern – The regular expression.
- flags –
Flags to use when compiling the pattern.
String Functions¶
Iterables¶
Helper functions for things that are iterable
-
exception
wex.iterable.
MultipleValuesError
[source]¶ More than one value was found when one or none were expected.
-
exception
wex.iterable.
ZeroValuesError
[source]¶ Zero values were found when at least one was expected.
-
wex.iterable.
islice
(*islice_args)[source]¶ Returns a function that will perform
itertools.islice
on its input.
-
wex.iterable.
one
(iterable)[source]¶ Returns an item from an iterable of exactly one element.
If the iterable comprises zero elements then
ZeroValuesError
is raised. If the iterable has more than one element thenMultipleValuesError
is raised.
-
wex.iterable.
one_or_none
(iterable)[source]¶ Returns one item or
None
from an iterable of length one or zero.If the iterable is empty then
None
is returned.If the iterable has more than one element then
MultipleValuesError
is raised.
URLs¶
Other Methods¶
-
wex.form.
form_values
(self)[source]¶ Return a list of tuples of the field values for the form. This is suitable to be passed to
urllib.urlencode()
.
-
class
wex.ftp.
RETRReadable
(ftp, basename)[source]¶ Just like ftplib.FTP.retrbinary, but implements read and readline.
-
wex.ftp.
close_on_empty
(unbound)[source]¶ Calls ‘close’ on first argument when method return something falsey.
The first argument is presumed to the self.
-
wex.ftp.
format_header
()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
-
wex.ftp.
format_status_line
()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
Functions for getting responses for HTTP urls.
-
wex.http.
format_header
()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
-
wex.http.
format_status_line
()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
Sitemaps¶
Extractors for URLs from /robots.txt and sitemaps.
-
wex.sitemaps.
urls_from_sitemaps
= Chained([<function urls_from_robots_txt at 0x7ffbd3f547d0>, <function urls_from_urlset_or_sitemapindex at 0x7ffbd3f54b18>])¶ Extractor that combines
urls_from_robots_txt()
andurls_from_urlset_or_sitemapindex()
.
-
wex.sitemaps.
urls_from_urlset_or_sitemapindex
(response)[source]¶ Yields URLs from
<urlset>
or<sitemapindex>
elements as per sitemaps.org.
Response¶
-
class
wex.response.
Response
(content, headers, url, code=None, **kw)[source]¶ A urllib2 style Response with some extras.
Parameters: - content – A file-like object containing the response content.
- headers – An HTTPMessage containing the response headers.
- url – The URL for which this is the response.
- code – The status code recieved with this response.
- protocol – The protocol received with this response.
- version – The protocol version received with this response.
- reason – The reason received with this response.
- request_url – The URL requested that led to this response.
Composed¶
Wextracto uses Function composition as an easy way to build new functions from existing ones:
>>> from wex.composed import compose
>>> def add1(x):
... return x + 1
...
>>> def mult2(x):
... return x * 2
...
>>> f = compose(add1, mult2)
>>> f(2)
6
Wextracto uses the pipe operator, |
, as a shorthand for function composition.
This shorthand can be a powerful technique for reducing boilerplate code when
used in combination with named()
extractors:
from wex.etree import css, text
from wex.extractor import named
attrs = named(title = css('h1') | text
description = css('#description') | text)
-
class
wex.composed.
ComposedCallable
(*functions)[source]¶ A callable, taking one argument, composed from other callables.
def mult2(x): return x * 2 def add1(x): return x + 1 composed = ComposedCallable(add1, mult2) for x in (1, 2, 3): assert composed(x) == mult2(add1(x))
ComposedCallable objects are
composable
. It can be composed of other ComposedCallable objects.
-
wex.composed.
composable
(func)[source]¶ Decorates a callable to support function composition using
|
.For example:
@Composable.decorate def add1(x): return x + 1 def mult2(x): return x * 2 composed = add1 | mult2
-
wex.composed.
compose
(*functions)[source]¶ Create a
ComposedCallable
from zero more functions.
Output¶
Extracted data values are represented with tab-separated fields. The right-most field on each line is the value, all preceding fields are labels that describe the value. The labels and the value are all JSON encoded.
So for example, a value 9.99 with a labels product
and price
would
look like:
"product" "price" 9.99
And we could decode this line with the following Python snippet:
>>> import json
>>> line = '"product"\t"price"\t9.99\n'
>>> [json.loads(s) for s in line.split('\t')]
[u'product', u'price', 9.99]
Using tab-delimiters is convenient for downstream processing using Unix command line tools such as cut and grep.
URL Labelling¶
The convention for Wextracto is that any URL that should be downloaded
is has the left-most label url
. For example:
"url" "http://example.net/some/url"
Data Labelling¶
If you are extracting multiple types of data (for example people and addresses) then a good labelling scheme is important.
It is a good idea to label the extracted values so that you can sort them easily using the Unix sort command.
An example of a labelling scheme that allows this would be:
{type} {identifier} {attribute} {value}
So we might end up with output that look like this:
"person" "http://example.net/person/1" "name" "Tom Bombadil"
"person" "http://example.net/person/1" "email" "tom1@example.net"
"address" "http://example.net/address/2" "city" "New York"
"address" "http://example.net/address/2" "postal code" "10001"
"person" "http://example.net/person/3" "name" "Jack Sprat"
"person" "http://example.net/person/3" "email" "jack3@example.net"
"address" "http://example.net/address/4" "city" "London"
"address" "http://example.net/address/4" "postal code" "E14 5AB"
With output like this we can easily sort and group it.
Regression Tests¶
When maintaining extractors it can be helpful to have some sample input and output so that regression testing can be performed when we need to change the extractors.
Wextracto supports this by using the --save
or --save-dir
options to
the wex
command. This option saves both the input and
output to a local directory.
This input and output can then be used for comparison with the current extractor output.
To check compare current output against saved output run py.test like so:
$ py.test saved/