src.crawler module

Crawler is responsible for instantiating Producers and Consumers, and – importantly – managing their execution and termination.

class src.crawler.Crawler(base_url: str, consumer_transformer: ~src.transformer.Transformer, producer_transformer: ~src.transformer.Transformer, consumer_count: int = 10, producer_count: int = 10, max_links: int = 10, sink_max: int = 10, source_max: int = 10, terminate: ~typing.Callable = <function Crawler.<lambda>>)

Bases: object

class Destructor(crawler: Crawler, event_bus: EventBus, terminate: Callable, drain_timeout: int = 1)

Bases: object

async drain(queue_name: str) None

Removes all items from the underlying sink/soure queues to trigger completion of the Queue.join call in the parent Crawler and terminate the run.

Parameters:

queue_name (str) – Name of the queue to drain. Should be either “source” or “sink”.

async shutdown(_: Any) None

Initiate a shutdown on the parent Crawler by terminating all running workers and draining the underlying queues.

Parameters:

_ (Any) – A generic parameter used to accept the Event, which, by this implementation, remains unused.

terminate_workers(producers: bool = True, consumers: bool = True) None

Cancel all running Producers and/or Consumers.

Parameters:
  • producers (bool) – Whether or not to cancel Producers.

  • consumers (bool) – Whether or not to cancel Consumers.

async watch(event: Any) None

A method invoked on every CrawlerEvents.Registry.UPDATED event, and used to determine if the crawler is done. If so, it emits CrawlerEvents.Crawler.TERMINATE to trigger a shutdown on the parent Crawler.

Parameters:

event (Any) – The event triggering invocation of watch.

consumer(id: int, transformer: Transformer)

Create a Consumer instance using the provided transformer. All Consumers generated by a given Crawler will use the same transformer.

Parameters:

transformer (Transformer) – A Transformer instance used to configure the generated Consumer.

Returns:

An instance of Consumer configured with the provided transformer.

Return type:

Consumer

property consumer_tasks

Returns a list of Consumer tasks.

async crawl() None

This method initiates the crawling process. It: - Generates a series of Producers and Consumers - Seeds the input queue - Initiates the Producer/Consumer Tasks - Waits for Producers/Consumers to finish, _or_ for Crawler to initiate shutdown - Waits for Queue.join to complete - Cancels hanging Consumers

classmethod create(base_url: str, consumer_transformer: Transformer, producer_transformer: Transformer, terminate: Callable, init: bool = True, producer_count: int = 10, consumer_count: int = 10, sink_max: int = 10, source_max: int = 10)

Factory method

The create method is used to create a new Crawler and generate its Producers/Consumers in one shot.

Parameters:
  • base_url (str) – The initial ‘seed’ URL used to kick off crawling.

  • consumer_transformer (Transformer) – The Transformer instance used to create Consumers.

  • producer_transformer (Transformer) – The Transformer instance used to create Producers.

  • terminate (Callable) – The function used to determine when the Crawler should initiate shutdown.

  • init (bool) – Boolean indicating whether to generate Producers/Consumers upon creation.

  • producer_count (int) – Number of Producers to configure.

  • consumer_count (int) – Number of Consumers to configure.

  • sink_max (int) – Maximum number of items that cana be placed in underlying sink queue.

  • source_max (int) – Maximum number of items that cana be placed in underlying source queue.

Returns:

Crawler instance. If init == True, the returned instance will have its producers and consumers properties populated immediately.

Return type:

Crawler

property done: bool

Boolean representing whether underlying registry is full.

Returns:

True if self.registry is full, False otherwise.

Return type:

bool

async generate_consumers()

Generate Consumers. These are ‘placeholders’ used by crawl to initiate the actual Consumption process.

async generate_producers()

Generate Producers. These are ‘placeholders’ used by crawl to initiate the actual Production process.

producer(id: int, transformer: Transformer) Producer

Create a Producer instance using the provided transformer. All Producers generated by a given Crawler will use the same transformer.

Parameters:

transformer (Transformer) – A Transformer instance used to configure the generated Producer.

Returns:

An instance of Producer configured with the provided transformer.

Return type:

Producer

property producer_tasks

Returns a list of Producer tasks.

seed_source(input_: str) None

Place an initial ‘seed’ element onto the source queue to initiate crawling.

Parameters:

input (str) – Initial element to place on queue.

shut_down() bool

Method to initiate a shutdown. This flags self.__shutting_down as True, and begins draining the self.source and self.sink queues.

Returns:

Returns the value of self.__shutting_down, which will always be True after invoking this method.

Return type:

bool

property shutting_down: bool

Boolean representing whether Crawler is in the process of draining its queues and shutting down.

Returns:

True if Crawler is draining and shutting down, False otherwise.

Return type:

bool

async update_registry(element: Any) Any

Meethod to add an element to the underlying registry and emit an event indicating that the registry has been updated. The event is ‘heard’ by the Destructor, which initiates shutdown when it is full.

Parameters:

element (Any) – Element added to registry.

Returns:

Returns the element added to the underlying Registry.

Return type:

Any