src.crawler module¶
Crawler is responsible for instantiating Producers and Consumers, and – importantly – managing their execution and termination.
- class src.crawler.Crawler(base_url: str, consumer_transformer: ~src.transformer.Transformer, producer_transformer: ~src.transformer.Transformer, consumer_count: int = 10, producer_count: int = 10, max_links: int = 10, sink_max: int = 10, source_max: int = 10, terminate: ~typing.Callable = <function Crawler.<lambda>>)¶
Bases:
object- class Destructor(crawler: Crawler, event_bus: EventBus, terminate: Callable, drain_timeout: int = 1)¶
Bases:
object- async drain(queue_name: str) None¶
Removes all items from the underlying sink/soure queues to trigger completion of the Queue.join call in the parent Crawler and terminate the run.
- Parameters:
queue_name (str) – Name of the queue to drain. Should be either “source” or “sink”.
- async shutdown(_: Any) None¶
Initiate a shutdown on the parent Crawler by terminating all running workers and draining the underlying queues.
- Parameters:
_ (Any) – A generic parameter used to accept the Event, which, by this implementation, remains unused.
- terminate_workers(producers: bool = True, consumers: bool = True) None¶
Cancel all running Producers and/or Consumers.
- Parameters:
producers (bool) – Whether or not to cancel Producers.
consumers (bool) – Whether or not to cancel Consumers.
- async watch(event: Any) None¶
A method invoked on every CrawlerEvents.Registry.UPDATED event, and used to determine if the crawler is done. If so, it emits CrawlerEvents.Crawler.TERMINATE to trigger a shutdown on the parent Crawler.
- Parameters:
event (Any) – The event triggering invocation of watch.
- consumer(id: int, transformer: Transformer)¶
Create a Consumer instance using the provided transformer. All Consumers generated by a given Crawler will use the same transformer.
- Parameters:
transformer (Transformer) – A Transformer instance used to configure the generated Consumer.
- Returns:
An instance of Consumer configured with the provided transformer.
- Return type:
- property consumer_tasks¶
Returns a list of Consumer tasks.
- async crawl() None¶
This method initiates the crawling process. It: - Generates a series of Producers and Consumers - Seeds the input queue - Initiates the Producer/Consumer Tasks - Waits for Producers/Consumers to finish, _or_ for Crawler to initiate shutdown - Waits for Queue.join to complete - Cancels hanging Consumers
- classmethod create(base_url: str, consumer_transformer: Transformer, producer_transformer: Transformer, terminate: Callable, init: bool = True, producer_count: int = 10, consumer_count: int = 10, sink_max: int = 10, source_max: int = 10)¶
Factory method
The create method is used to create a new Crawler and generate its Producers/Consumers in one shot.
- Parameters:
base_url (str) – The initial ‘seed’ URL used to kick off crawling.
consumer_transformer (Transformer) – The Transformer instance used to create Consumers.
producer_transformer (Transformer) – The Transformer instance used to create Producers.
terminate (Callable) – The function used to determine when the Crawler should initiate shutdown.
init (bool) – Boolean indicating whether to generate Producers/Consumers upon creation.
producer_count (int) – Number of Producers to configure.
consumer_count (int) – Number of Consumers to configure.
sink_max (int) – Maximum number of items that cana be placed in underlying sink queue.
source_max (int) – Maximum number of items that cana be placed in underlying source queue.
- Returns:
Crawler instance. If init == True, the returned instance will have its producers and consumers properties populated immediately.
- Return type:
- property done: bool¶
Boolean representing whether underlying registry is full.
- Returns:
True if self.registry is full, False otherwise.
- Return type:
bool
- async generate_consumers()¶
Generate Consumers. These are ‘placeholders’ used by crawl to initiate the actual Consumption process.
- async generate_producers()¶
Generate Producers. These are ‘placeholders’ used by crawl to initiate the actual Production process.
- producer(id: int, transformer: Transformer) Producer¶
Create a Producer instance using the provided transformer. All Producers generated by a given Crawler will use the same transformer.
- Parameters:
transformer (Transformer) – A Transformer instance used to configure the generated Producer.
- Returns:
An instance of Producer configured with the provided transformer.
- Return type:
- property producer_tasks¶
Returns a list of Producer tasks.
- seed_source(input_: str) None¶
Place an initial ‘seed’ element onto the source queue to initiate crawling.
- Parameters:
input (str) – Initial element to place on queue.
- shut_down() bool¶
Method to initiate a shutdown. This flags self.__shutting_down as True, and begins draining the self.source and self.sink queues.
- Returns:
Returns the value of self.__shutting_down, which will always be True after invoking this method.
- Return type:
bool
- property shutting_down: bool¶
Boolean representing whether Crawler is in the process of draining its queues and shutting down.
- Returns:
True if Crawler is draining and shutting down, False otherwise.
- Return type:
bool
- async update_registry(element: Any) Any¶
Meethod to add an element to the underlying registry and emit an event indicating that the registry has been updated. The event is ‘heard’ by the Destructor, which initiates shutdown when it is full.
- Parameters:
element (Any) – Element added to registry.
- Returns:
Returns the element added to the underlying Registry.
- Return type:
Any