Navigating Scrapy Performance Pitfalls: An Unpickleable Iterator's Tale
Table of contents
No headings in the article.
I write according to the issues I find in GitHub so I write it on the finding
Introduction: In today's world, web scraping is an essential tool to extract web data. One of the most popular frameworks for this task is Scrapy. However, like all software, it isn't free from quirks and challenges. In this blog post, we'll dive deep into a performance pitfall caused by an unpickable iterator and discuss a solution that successfully circumvented the issue.
The Challenge: Imagine a Scrapy spider whose task is to generate company registration numbers (CRNs) and make requests with them. If a sequence of 1000 requests fails consecutively, the spider's instances exit prematurely. To satisfy this condition, a change was made: the class was transformed from an iterable to an iterator. While this change made logical sense, it rendered class instances unpickleable. As a result, all the requests began to accumulate in the RAM queue instead of the disk queue, leading to a drastic performance slowdown over time.
The Symptoms:
A gradual reduction in request speed from an initial 8k requests per minute to a mere 2k.
The spider had to be restarted frequently to maintain performance.
No noticeable increase in RAM usage, but the shift of requests from disk to RAM indicated potential bottlenecks.
Digging Deeper: Unpickleable objects can lead to inefficiencies in Python, especially when they interact with RAM. Storing such objects in RAM can result in increased memory consumption and may introduce CPU overhead, both of which are performance killers.
The Solution: To tackle this challenge, we devised a solution: make all requests serializable. Instead of passing the iterator itself, pass the index of the iterator. Here's the modified method from the spider to illustrate this approach:
def start(self, /):
it = ((n, gen, crn) for n, gen in enumerate(self.generators) for crn in gen)
for n, generator, crn in it:
if crn is None:
yield Request('data:,', self.__start, dont_filter=True, errback=self.__start)
self.logger.info(
f'Generator {generator} returned None; '
f'generation will proceed in 60 seconds'
)
return
self.latest_crn = crn
yield Request(
'https://example.com',
self.check_crn_status,
dont_filter=True,
errback=self.check_crn_status_failed,
cb_kwargs=dict(crn=crn, generator_n=n),
)
By making this change, the issue was resolved, and the spider maintained a consistent performance throughout its run.
Takeaways & Recommendations:
Monitoring is Key: Always monitor both RAM and CPU usage. Bottlenecks can manifest in unexpected places.
Pickleability Matters: Unpickleable objects can be a source of inefficiencies in Python, especially when they're interacting with memory.
Stay Updated: Regularly update Scrapy and related packages. Bug fixes and performance improvements are frequent in open-source tools.
Log Wisely: Extensive logging can slow down an application. Adjust logging levels as per needs and consider asynchronous logging.
Conclusion: Performance issues can crop up in unexpected places, even when using well-established frameworks like Scrapy. However, with careful analysis and a bit of Pythonic ingenuity, most challenges can be overcome. Always remember to keep an eye on your system's metrics and be ready to adapt and evolve your code.
https://github.com/scrapy/scrapy/issues/6119 here is the link of the issue.