Wextracto¶
Wextracto is a framework for data extraction from web resources.
Tutorial¶
The tutorial gives step-by-step instructions to get started with Wextracto.
Tutorial¶
Introduction¶
This tutorial shows you how to extract data from an HTML web page using Wextracto.
To work through the tutorial you need to download and install Python.
You also need to install Wextracto. If you can, you should install it into a virtual environment because this makes things easier to manage. The recommended way to install Wextracto is using pip:
$ pip install Wextracto
This will install the wex
command:
$ wex --help
You are now ready to begin the tutorial.
Writing A Minimal Extractor¶
An extractor is a function that takes an HTTP response and returns values
extracted from it. Our extractor is going to return the URL of the response.
Write or copy the following into a file called tutorial.py
:
def extract(response):
return response.geturl()
The response
parameter here is
file-like object of the type used by the standard library
urllib2.
Now we need to tell the wex
command about our new extractor. We do this
by creating a file called entry_points.txt
with the following contents:
[wex]
extract = tutorial:extract
Now run wex
with the following URL:
$ wex http://gilessbrown.github.io/cheeses/cheddar.html
"http://gilessbrown.github.io/cheeses/cheddar.html"
Congratulations, you have just written an extractor!
Selecting Elements¶
Python has a great library for processing XML and HTML data called lxml. We can use this library in our extractor.
Let’s use a simple XPath expression to get some text from our chosen web page.
Edit tutorial.py
to look like this:
from lxml.html import parse
def extract(response):
tree = parse(response)
return tree.xpath('//h1/text()')
Now re-run wex
with the same URL we used previously:
$ wex http://gilessbrown.github.io/cheeses/cheddar.html
["\n\t\t\tCheddar\n\t\t"]
You may be wondering about the square brackets around the text. That is
because wex
serializes values using
JSON.
Our XPath expression returns a Python list which gives us the square brackets
in JSON.
You may also have noticed the leading and trailing whitespace. We’ll look at how to get rid of that in the next section.
Extracting Text¶
We normally want text we extract HTML elements to be space-normalized. This means runs of whitespace are converted into a single space character and leading and trailing whitespace is trimmed.
Wextracto provides the text
function to return
the space-normalized text for each selected element.
Here is what our extractor now looks like:
from lxml.html import parse
from wex.etree import text
def extract(response):
tree = parse(response)
return text(tree.xpath('//h1/text()'))
Let’s run wex
with the usual URL again to check the result:
$ wex http://gilessbrown.github.io/cheeses/cheddar.html
"Cheddar"
That is much tidier.
You may be wondering why we don’t just use the XPath normalize-space function. There actually several reasons why we do not want to do this, most of which are specific to extracting text from HTML as opposed to XML:
- The
text
function: - understands <br> tags
- uses a unicode definition of whitespace (e.g. non breaking spaces)
- can work with multiple nodes in an node-set
Multiple Values¶
Often we want to extract multiple values from our web page. This is done by yield-ing values instead return-ing a single value.
So that we know which value is which we also label the values by yielding a name for the value at the same time.
Modify tutorial.py
to yield the names and values:
from lxml.html import parse
from wex.etree import text
def extract(response):
tree = parse(response)
yield "name", text(tree.xpath('//h1'))
yield "country", text(tree.xpath('//dd[@id="country"]'))
yield "region", text(tree.xpath('//dd[@id="region"]'))
Now re-run wex
:
$ wex http://gilessbrown.github.io/cheeses/cheddar.html
"name" "Cheddar"
"country" "England"
"region" "Somerset"
Wextracto uses the tab character to separate the label from the value.
Errors¶
Yielding multiple values from an extractor is ok if all the values extract successfully. Unfortunately, if they don’t, we don’t get the remaining values even if they would have extracted successfully.
Let’s extend the extractor we wrote in the previous section and add a new attribute. This time let’s deliberately make a mistake so we can see what happens:
from wex.etree import parse, text
def extract(response):
tree = parse(response)
yield "name", text(tree.xpath('//h1/text()'))
yield "whoops", 1/0
yield "country", text(tree.xpath('//dd[@id="country"]'))
yield "region", text(tree.xpath('//dd[@id="region"]'))
Now re-run wex
:
$ wex http://gilessbrown.github.io/cheeses/cheddar.html
"name" "Cheddar"
#ZeroDivisionError('integer division or modulo by zero',)!
The #
and !
at the start and end of that final line is Wextracto’s
way of telling us that we ended up with a value that was not JSON encodable.
In this case because there was a ZeroDivisionError exception.
Notice how we didn’t see more values following the exception.
What we’d really like is for each attribute to be extracted in such a way that an exception while extracting one attribute doesn’t mean the others don’t get extracted.
To make that happen we’ll need each attribute to be extracted in its own function. In the next section we’ll see how Wextracto helps you do that.
Named¶
Wextracto provides a function specifically for extracting named values
and it is called wex.extractor.named()
. This lets you
create a collection of extractors each of which has a name. The class
instance is itself callable it it yields the results of each extractor in
the collection together with its name.
Extractors can be added to the collection by decorating them with the collections :method:`.Named.add` method.
Copy the code from here:
from wex.extractor import named
from wex.etree import xpath, text
extract = named()
@extract.add
def name(response):
return text(xpath('//h1')(response))
@extract.add
def whoops(response):
return 1/0
@extract.add
def country(response):
return text(xpath('//dd[@id="country"]')(response))
@extract.add
def region(response):
return text(xpath('//dd[@id="region"]')(response))
You may notice that we have switched from calling .xpath()
on the
element tree to using the wex.etree.xpath
function. The function produce
by calling this function knows when to parse the response so we don’t need to
organize that.
Let’s try running our extractor now and see what we get:
$ wex http://gilessbrown.github.io/cheeses/cheddar.html
"country" "England"
"whoops" #ZeroDivisionError('integer division or modulo by zero',)!
"region" "Somerset"
"name" "Cheddar"
Now we’ve got something for all the named values we wanted and in addition it tells which extractor isn’t working.
Composing Extractors¶
If you need to write a lot of extractors then you may find that the using
the decorator syntax for wex.extractor.Named
leads to a lot of
boilerplate code. Fortunately there is an alternative.
If you look at the examples in the previous section,
you will see that the extractors (apart from whoops
) all look
something like:
def xyz(response):
return text(xpath(...)(response))
It turns out this kind of pattern is very common in writing extractors. A technique called function composition lets us define these extractor functions very succinctly.
In Wextracto function composition is performed with the |
operator
(like Unix pipes).
So we can define the extractor above as:
xyz = xpath(...) | text
We can pass these composed functions directly into the constructor for wex.extractor.Named and get something that looks like:
from wex.extractor import named
from wex.etree import xpath, text
extract = named(name = xpath('//h1') | text,
country = xpath('//dd[@id="country"]') | text,
region = xpath('//dd[@id="region"]') | text)
As you can see, this is a very compact representation for simple extractors.
Labelling¶
So far we’ve only been extracting data from one web page, but eventually we’d like to move on to extracting from multiple pages. Let’s see what happens:
$ wex http://gilessbrown.github.io/cheeses/cheddar.html http://gilessbrown.github.io/cheeses/brie.html
"country" "tEngland"
"region" "Somerset"
"name" "Cheddar"
"country" "France"
"region" "Seine-et-Marne"
"name" "Brie"
Oh dear. It isn’t very clear which value came from which web page.
We can fix this by using the wex.extractor.label()
function:
from wex.extractor import label, named
from wex.url import url
from wex.etree import xpath, text
attrs = named(name = xpath('//h1') | text,
country = xpath('//dd[@id="country"]') | text,
region = xpath('//dd[@id="region"]') | text)
extract = label(url)(attrs)
The code here is going to label the output with the URL of the current response.
Let’s try it:
$ wex http://gilessbrown.github.io/cheeses/cheddar.html http://gilessbrown.github.io/cheeses/brie.html
"http://gilessbrown.github.io/cheeses/cheddar.html" "country" "England"
"http://gilessbrown.github.io/cheeses/cheddar.html" "region" "Somerset"
"http://gilessbrown.github.io/cheeses/cheddar.html" "name" "Cheddar"
"http://gilessbrown.github.io/cheeses/brie.html" "country" "France"
"http://gilessbrown.github.io/cheeses/brie.html" "region" "Seine-et-Marne"
"http://gilessbrown.github.io/cheeses/brie.html" "name" "Brie"
As before, the labels are tab delimited.
Multiple Entities¶
In the Labelling section we saw how we can label values with the URL from which they came, but sometimes you get multiple entities on the same web page and they each have their own set of attributes.
Let’s try our extractor on that kind of page:
$ wex http://gilessbrown.github.io/cheeses/gloucester.html
"http://gilessbrown.github.io/cheeses/gloucester.html" "country" #MultipleValuesError()!
"http://gilessbrown.github.io/cheeses/gloucester.html" "region" #MultipleValuesError()!
"http://gilessbrown.github.io/cheeses/gloucester.html" "name" #MultipleValuesError()!
Oh dear. What can we do? Well if we visit that web page in a browser and
view the source
we find that each <h1>
helpfully has a International Cheese
Identification Number (ICIN) as an attribute.
So what we can do is re-write the extractor to visit each <h1>
and extract the data we want relative to that element.
Here is what the code looks like:
from wex.extractor import named, labelled
from wex.iterable import one
from wex.etree import xpath, text
cheeses = xpath('//h1[@data-icin]')
icin_attr = xpath('@data-icin') | one
attrs = named(name = text,
country = xpath('following::dd[@id="country"][1]') | text,
region = xpath('following::dd[@id="region"][1]') | text)
extract_cheese = labelled(icin_attr, attrs)
def extract(response):
for cheese in cheeses(response):
for item in extract_cheese(cheese):
yield item
And then we run wex
:
$ wex http://gilessbrown.github.io/cheeses/gloucester.html
"SNGGLCD7DDFD41" "country" "England"
"SNGGLCD7DDFD41" "region" "Gloucestershire"
"SNGGLCD7DDFD41" "name" "Single Gloucester"
"DBLGLCCECAA22C" "country" "England"
"DBLGLCCECAA22C" "region" "Gloucestershire"
"DBLGLCCECAA22C" "name" "Double Gloucester"
What Next?¶
- Read the User Guide.
- Read the source code.
User Guide¶
The user guide gives broad general guidance about how to use Wextracto.
User Guide¶
What is Wextracto?¶
Wextracto is a Python package designed to be the core of a web crawling/scraping system.
To understand how it fits in, let us look at the general architecture of a web crawling/scraping based on Wextracto.
This architecture has these components:
- Schedule
- This component manages the URLs to be downloaded. The goal is to keep track which URLs you have downloaded and which URLs you have yet to download.
- Download
- This component requests web pages and stores the responses for use by the Wextracto component.
- Wextracto
- This component reads the stored responses and extracts URLs and data. URLs are routed to the Schedule component.
- Collate
- This component receives data from the Wextracto component and organizes it ready for use. Organizing the data might involve storing it in a database.
Each of the other three components (Schedule, Download and Collate) can be implemented in multiple ways depending on the requirements of the crawl system. Keeping just the core in Wextracto gives better seperation of concerns
Interfaces¶
In the architecture diagram you can see Wextracto has three data flows. One incoming (responses) and two outgoing (urls and data).
Responses¶
Although Wextracto can download and extract in one go, it is designed to be used in system where the downloading is done separately from the extraction.
Having the download separate from extraction is generally helpful because:
- it allows us to repeat the extraction process exactly for problem finding
- it gives us easy access to large sample data sets
- it can make the management of I/O in the system clearer
Wextracto can process responses that look like HTTP responses (headers then content). For example:
$ curl -D- http://httpbin.org/ip
HTTP/1.1 200 OK
Connection: keep-alive
Server: gunicorn/18.0
Date: Tue, 30 Dec 2014 19:32:18 GMT
Content-Type: application/json
Content-Length: 32
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Via: 1.1 vegur
{
"origin": "67.180.76.235"
}
Although most extractors
will require the presence of a
custom HTTP header, X-wex-request-url
, that contains the requested URL.
Any component preparing responses for processing using Wextracto should add
this header.
A request can lead to multiple responses, each with their own URL. In the
case where the response URL is not the same as the request URL an additional
header, X-wex-url
, that contains the responses URL.
Wextracto looks for responses such as these in files that have a .wexin
extension. It can also read a
.tar file containing
files with the same extension.
Reference¶
The reference gives detailed information on the modules and classes provided by the Wextracto package.
Reference¶
This is the reference for the Wextracto web data extraction package.
Command¶
The wex
command extracts data from HTTP-like responses.
These responses can come from files, directories or URLs specified on the
command line. The command calls any extractors
that have
been registered
and writes any data extracted
as output
.
The output and input can be saved, using the --save
or --save-dir
command line arguments. This is useful for
regression testing.
existing extractor functions.
The test are run using py.test
.
For the complete list of command line arguments run:
$ wex --help
Registering Extractors¶
The simplest way to register extractors
is to have
a file named entry_points.txt
in the current directory. This file should
look something like this:
[wex]
.example.net = mymodule:extract_from_example_net
The [wex]
section heading tells Wextracto that
the following lines register extractors.
Extractors are registered using name = value
pairs.
If the name starts with .
then the extractor is only applied to
responses from
domain names
that match that name.
Our example would match responses from www.example.net
or example.net
.
If the name does not start with .
it will be applied responses whatever
their domain.
You can register the same extractor against multiple domain names by having multiple lines with the same value but different names.
This is exactly the same format and content that you would use in the
entry_points
parameter for a
setup function,
if and when you want to package and your extractor functions.
Extractor¶
An extractor is a callable that returns or yields data. For example:
def extract(response):
return "something"
The response
parameter here is an instance of
wex.response.Response
.
Extractors can be combined in various ways.
-
class
wex.extractor.
Named
(**kw)[source]¶ A extractor that is a collection of named extractors.
Extractors can be added to the collection on construction using keyword arguments for the names or they can be added using
add()
.The names are labels in the output produced. For example, an extractor function
extract
defined as follows:extract = Named( name1 = (lambda response: "one"), name2 = (lambda response: "two"), )
Would produce the extraction output something like this:
$ wex http://example.net/ "name1" "one" "name2" "two"
The ordering of sub-extractor output is arbitrary.
-
add
(extractor, label=None)[source]¶ Add an attribute extractor.
Parameters: - extractor (callable) – The extractor to be added.
- label (str) – The label for the extractor.
This may be
None
in which case the extractors__name__
attribute will be used.
This method returns the extractor added. This means it can also be used as a decorator. For example:
attrs = Named() @attrs.add def attr1(response): return "one"
-
-
wex.extractor.
chained
(*extractors)[source]¶ Returns an extractor that chains the output of other extractors.
The output is the output from each extractor in sequence.
Parameters: extractors – an iterable of extractor callables to chain For example an extractor function
extract
defined as follows:def extract1(response): yield "one" def extract2(response): yield "two" extract = chained(extract1, extract2)
Would produce the following extraction output:
$ wex http://example.net/ "one" "two"
-
wex.extractor.
labelled
(*args)[source]¶ Returns an extractor decorator that will label the output an extractor.
Parameters: literals_or_callables – An iterable of labels or callables. Each item in
literals_or_callables
may be a literal or a callable. Any callable will called with the same parameters as the extractor and whatever is returned will by used as a label.For example an extractor function
extract
defined as follows:def extract1(response): yield "one" def label2(response): return "label2" extract = label("label1", label2)(extract1)
Would produce the following extraction output:
$ wex http://example.net/ "label1" "label2" "one"
Note that if any of the labels are false then no output will be generated from that extractor.
Element Tree¶
Composable functions for extracting data using lxml.
-
wex.etree.
base_url_pair_getter
(get_url)[source]¶ Returns a function for gettting a tuple of (base_url, url) when called with an etree Element or ElementTree.
In the returned pair base_url is the value returned from :func:get_base_url on the etree Element or ElementTree. There second value is the value returned by calling the get_url on the same the same etree Element or ElementTree, joined to the base_url using urljoin. This allows get_url to return a relative URL.
-
wex.etree.
css
(expression)[source]¶ Returns a
composable
callable that will select elements defined by a CSS selector expression.Parameters: expression – The CSS selector expression. The callable returned accepts a
wex.response.Response
, a list of elements or an individual element as an argument.
-
wex.etree.
drop_tree
(*selectors)[source]¶ Return a function that will remove trees selected by selectors.
-
wex.etree.
href_any_url
¶ A
wex.composed.ComposedFunction
that returns the absolute URL from anhref
attribute.
-
wex.etree.
href_url
¶ A
wex.composed.ComposedFunction
that returns the absolute URL from anhref
attribute as long as it is from the same domain as the base URl of the response.
-
wex.etree.
href_url_same_suffix
¶ A
wex.composed.ComposedFunction
that returns the absolute URL from anhref
attribute as long as it is from the same public suffix as the base URl of the response.
-
wex.etree.
same_suffix
(url_pair)[source]¶ Return second url of pair if both have the same public suffix.
-
wex.etree.
src_url
¶ A
wex.composed.ComposedFunction
that returns the absolute URL from ansrc
attribute.
-
wex.etree.
text
¶ Alias for normalize-space | list2set
-
wex.etree.
text_content
¶ Return text content from an object (typically node-set) excluding from content from within <script> or <style> elements.
-
wex.etree.
xpath
(expression, namespaces={u're': u'http://exslt.org/regular-expressions'})[source]¶ Returns
composable
callable that will select elements defined by an XPath expression.Parameters: - expression – The XPath expression.
- namespaces – The namespaces.
The callable returned accepts a
wex.response.Response
, a list of elements or an individual element as an argument.For example:
>>> from lxml.html import fromstring >>> tree = fromstring('<h1>Hello</h1>') >>> selector = xpath('//h1')
Regular Expressions¶
-
wex.regex.
re_group
(pattern, group=1, flags=0)[source]¶ Returns a
composable
callable that extract the specified group using a regular expression.Parameters: - pattern – The regular expression.
- group – The group from the MatchObject.
- flags – Flags to use when compiling the pattern.
-
wex.regex.
re_groupdict
(pattern, flags=0)[source]¶ Returns a
composable
callable that extract the a group dictionary using a regular expression.Parameters: - pattern – The regular expression.
- flags –
Flags to use when compiling the pattern.
String Functions¶
Iterables¶
Helper functions for things that are iterable
-
exception
wex.iterable.
MultipleValuesError
[source]¶ More than one value was found when one or none were expected.
-
exception
wex.iterable.
ZeroValuesError
[source]¶ Zero values were found when at least one was expected.
-
wex.iterable.
islice
(*islice_args)[source]¶ Returns a function that will perform
itertools.islice
on its input.
-
wex.iterable.
one
(iterable)[source]¶ Returns an item from an iterable of exactly one element.
If the iterable comprises zero elements then
ZeroValuesError
is raised. If the iterable has more than one element thenMultipleValuesError
is raised.
-
wex.iterable.
one_or_none
(iterable)[source]¶ Returns one item or
None
from an iterable of length one or zero.If the iterable is empty then
None
is returned.If the iterable has more than one element then
MultipleValuesError
is raised.
URLs¶
Other Methods¶
-
wex.form.
form_values
(self)[source]¶ Return a list of tuples of the field values for the form. This is suitable to be passed to
urllib.urlencode()
.
-
class
wex.ftp.
RETRReadable
(ftp, basename)[source]¶ Just like ftplib.FTP.retrbinary, but implements read and readline.
-
wex.ftp.
close_on_empty
(unbound)[source]¶ Calls ‘close’ on first argument when method return something falsey.
The first argument is presumed to the self.
-
wex.ftp.
format_header
()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
-
wex.ftp.
format_status_line
()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
Functions for getting responses for HTTP urls.
-
wex.http.
format_header
()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
-
wex.http.
format_status_line
()¶ S.format(*args, **kwargs) -> unicode
Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
Sitemaps¶
Extractors for URLs from /robots.txt and sitemaps.
-
wex.sitemaps.
urls_from_sitemaps
= Chained([<function urls_from_robots_txt at 0x7f6bbc4ee7d0>, <function urls_from_urlset_or_sitemapindex at 0x7f6bbc4eeb18>])¶ Extractor that combines
urls_from_robots_txt()
andurls_from_urlset_or_sitemapindex()
.
-
wex.sitemaps.
urls_from_urlset_or_sitemapindex
(response)[source]¶ Yields URLs from
<urlset>
or<sitemapindex>
elements as per sitemaps.org.
Response¶
-
class
wex.response.
Response
(content, headers, url, code=None, **kw)[source]¶ A urllib2 style Response with some extras.
Parameters: - content – A file-like object containing the response content.
- headers – An HTTPMessage containing the response headers.
- url – The URL for which this is the response.
- code – The status code recieved with this response.
- protocol – The protocol received with this response.
- version – The protocol version received with this response.
- reason – The reason received with this response.
- request_url – The URL requested that led to this response.
Composed¶
Wextracto uses Function composition as an easy way to build new functions from existing ones:
>>> from wex.composed import compose
>>> def add1(x):
... return x + 1
...
>>> def mult2(x):
... return x * 2
...
>>> f = compose(add1, mult2)
>>> f(2)
6
Wextracto uses the pipe operator, |
, as a shorthand for function composition.
This shorthand can be a powerful technique for reducing boilerplate code when
used in combination with named()
extractors:
from wex.etree import css, text
from wex.extractor import named
attrs = named(title = css('h1') | text
description = css('#description') | text)
-
class
wex.composed.
ComposedCallable
(*functions)[source]¶ A callable, taking one argument, composed from other callables.
def mult2(x): return x * 2 def add1(x): return x + 1 composed = ComposedCallable(add1, mult2) for x in (1, 2, 3): assert composed(x) == mult2(add1(x))
ComposedCallable objects are
composable
. It can be composed of other ComposedCallable objects.
-
wex.composed.
composable
(func)[source]¶ Decorates a callable to support function composition using
|
.For example:
@Composable.decorate def add1(x): return x + 1 def mult2(x): return x * 2 composed = add1 | mult2
-
wex.composed.
compose
(*functions)[source]¶ Create a
ComposedCallable
from zero more functions.
Output¶
Extracted data values are represented with tab-separated fields. The right-most field on each line is the value, all preceding fields are labels that describe the value. The labels and the value are all JSON encoded.
So for example, a value 9.99 with a labels product
and price
would
look like:
"product" "price" 9.99
And we could decode this line with the following Python snippet:
>>> import json
>>> line = '"product"\t"price"\t9.99\n'
>>> [json.loads(s) for s in line.split('\t')]
[u'product', u'price', 9.99]
Using tab-delimiters is convenient for downstream processing using Unix command line tools such as cut and grep.
URL Labelling¶
The convention for Wextracto is that any URL that should be downloaded
is has the left-most label url
. For example:
"url" "http://example.net/some/url"
Data Labelling¶
If you are extracting multiple types of data (for example people and addresses) then a good labelling scheme is important.
It is a good idea to label the extracted values so that you can sort them easily using the Unix sort command.
An example of a labelling scheme that allows this would be:
{type} {identifier} {attribute} {value}
So we might end up with output that look like this:
"person" "http://example.net/person/1" "name" "Tom Bombadil"
"person" "http://example.net/person/1" "email" "tom1@example.net"
"address" "http://example.net/address/2" "city" "New York"
"address" "http://example.net/address/2" "postal code" "10001"
"person" "http://example.net/person/3" "name" "Jack Sprat"
"person" "http://example.net/person/3" "email" "jack3@example.net"
"address" "http://example.net/address/4" "city" "London"
"address" "http://example.net/address/4" "postal code" "E14 5AB"
With output like this we can easily sort and group it.
Regression Tests¶
When maintaining extractors it can be helpful to have some sample input and output so that regression testing can be performed when we need to change the extractors.
Wextracto supports this by using the --save
or --save-dir
options to
the wex
command. This option saves both the input and
output to a local directory.
This input and output can then be used for comparison with the current extractor output.
To check compare current output against saved output run py.test like so:
$ py.test saved/