jeudi 28 avril 2016

Create unit-test for method of scrapy CrawlSpider

The initial problem

I am writing a Spider class (using the scrapy library) and rely on a lot of scrapy asynchronous magic to make it work. Here it is, stripped down:

class MySpider(CrawlSpider):
    rules = [Rule(LinkExtractor(allow='myregex'), callback='parse_page')]
    # some other class attributes

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.response = None
        self.loader = None

    def parse_page_section(self):
        soup = BeautifulSoup(self.response.body, 'lxml')
        # Complicated scraping logic using BeautifulSoup
        self.loader.add_value(mykey, myvalue)

    # more methods parsing other sections of the page
    # also using self.response and self.loader

    def parse_page(self, response):
        self.response = response
        self.loader = ItemLoader(item=Item(), response=response)
        self.parse_page_section()
        # call other methods to collect more stuff
        self.loader.load_item()

The class attribute rule tells my spider to follow certain links and jump to a callback function once the web-pages are downloaded. My goal is to test the parsing method called parse_page_section without running the crawler or even making real HTTP requests.

What I tried

Instinctively, I turned myself to the unittest.mock library. I understand how you mock a function to test whether it has been called (with which arguments and if there were any side effects...), but that's not what I want. I want to instantiate a fake object MySpider and assign just enough attributes to be able to call parse_page_section method on it.

In the above example, I need a response object to instantiate my ItemLoader and specifically a self.response.body attribute to instantiate my BeautifulSoup. In principle, I could make fake objects like this:

from argparse import Namespace

my_spider = MySpider(CrawlSpider)
my_spider.response = NameSpace(body='<html>...</html>')

That works well to for the BeautifulSoup class but I would need to add more attributes to create an ItemLoader object. For more complex situations, it would become ugly and unmanageable.

My questions

How do I test a method on an object in a partially defined state? Is this the right approach altogether? I can't find similar examples on the web, so I think my approach my may be wrong at a more fundamental level. Any insight would be greatly appreciated.

Aucun commentaire:

Enregistrer un commentaire