0
\$\begingroup\$

I am working on a Web Scraper for the first time using Test Driven Development, however I have caught myself into a huge CRAP (Change Risk Anti-Patterns) index (56) and I can not seem to find a solution about this.

<?php

/**
 * Code Snippet MIT licensed taken from proprietary source.
 */

namespace WebScraper;
use WebScraper\Contracts\Scraper;
use WebScraper\Collections\BaseCollection;
use WebScraper\Exceptions\EngineIgnitionException;

class Engine
{
    protected $targetAdAge;

    protected $currentAdAge = 1;

    protected $targetPage  = 999;

    protected $currentPage = 1;

    public function __construct(Scraper $scraper, BaseCollection $collection)
    {
        $this->scraper    = $scraper;
        $this->collection = $collection;
    }

    public function start($targetAdAge = null)
    {
        if ($this->collection->count() >= 1)
        {
            throw new EngineIgnitionException("Engine can not be ran twice in the same instance.");
        }

        if (array_key_exists('adage', $requestData = $scraper->queryBuilder()->buildArray()))
        {
            $this->targetAdAge = $requestData['adage'];
        }

        if (! $this->targetAdAge)
        {
            throw new EngineIgnitionException("targetAdAge value is a core part of the Engine and it is missing.");
        }

        for ($this->currentAdAge = 1; $this->currentAdAge <= $this->targetAdAge; ++$this->currentAdAge)
        {
            for ($this->currentPage = 1; $this->currentPage <= $this->targetPage; ++$this->currentPage)
            {
                $this->scraper->queryBuilder()->adage($this->currentAdAge)->page($this->currentPage);
                $parser = $this->scraper->extract()->parse();
                $items = $parser->items();
                $this->targetPage = $parser->maxPages();
                foreach ($items as &$item)
                {
                    $item->age = (new DateTime('now'))->modify("-{$this->currentAdAge} day");
                    $this->collection->push($item);
                }

                unset($item);
            }
        }

        return $this->collection;
    }
}

The main issue is in the start() function.

Basically, what this class does is building an Engine to web scrape a website that uses pagination to show their data, and because of so I am forced to use iterators and scrape every single page to gather all the data. Line 48 is the function that does the dirty job and scrapes the website.

The thing is, this function is likely to take even minutes to complete if there's a lot of data to extract or if scraping filters are very broad. This is the only function where I really do not know how to refactor it in a better way. The good thing, however, is that every other single class has a low CRAP index (< 5) and Code Coverage is above 85%.

How do I refactor the iterator of page and adAge?

\$\endgroup\$
2
  • \$\begingroup\$ Would you mind explaining just exactly what "CRAP" you have spotted? What's the "change risk anti-pattern" here? The use of an iterator? \$\endgroup\$ Commented Feb 29, 2016 at 13:00
  • \$\begingroup\$ @Pimgd PHPUnit has warned me of the high CRAP index in the start() function, which I believe is caused by the two iterators. \$\endgroup\$ Commented Feb 29, 2016 at 13:02

1 Answer 1

2
+50
\$\begingroup\$

The code looks mostly fine to me. If you feel that the code isn't as clean as it should be, or just want it to pass the metric, what you could do is split it into functions:

            $this->scraper->queryBuilder()->adage($this->currentAdAge)->page($this->currentPage);
            $parser = $this->scraper->extract()->parse();
            $items = $parser->items();
            $this->targetPage = $parser->maxPages();

That goes into a private function which sets targetPage and returns $items... you could call it parsePageToItems

And this

            foreach ($items as &$item)
            {
                $item->age = (new DateTime('now'))->modify("-{$this->currentAdAge} day");
                $this->collection->push($item);
            }

            unset($item);

Could go into a separate function mapItems.

You could then write a test for mapItems in particular, increasing coverage and further reducing your CRAP score.

\$\endgroup\$
2
  • \$\begingroup\$ I am struggling at finding a useful test for the Engine class. I don't think a test is even necessary for this class since every single part of code behind the scenes is fully tested. Maybe the only thing that could make sense to test are the Exceptions? \$\endgroup\$ Commented Mar 1, 2016 at 11:37
  • \$\begingroup\$ The test, of course, is to see if it puts the items in the collection properly. You know, happy path testing. Just show that it works! \$\endgroup\$ Commented Mar 1, 2016 at 13:12

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.