Mocy is a simple web crawling framework that is flexible and easy to use.
-
Concurrent downloads
-
Decorators like
@before_download,@after_download,@pipe -
Rate limit and retry mechanism
-
Session keeping
-
More...
$ pip install mocyThe following is a simple spider to extract upcoming Python events.
from mocy import Spider, Request, pipe
class SimpleSpider(Spider):
entry = 'https://www.python.org/'
def parse(self, res):
for link in res.select('.event-widget li a'):
yield Request(
link['href'],
state={'name': link.text},
callback=self.parse_detail_page
)
def parse_detail_page(self, res):
date = ' '.join(res.select('.single-event-date')[0].stripped_strings)
yield res.state['name'], date
@pipe
def output(self, item):
print(f'{item[0]} will be held on "{item[1]}"')
SimpleSpider().start()The Result is:
[2021-06-08 00:59:22] INFO : Spider is running...
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/" 200 0.27s
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/events/python-events/1094/" 200 0.61s
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/events/python-events/964/" 200 0.63s
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/events/python-events/1036/" 200 0.69s
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/events/python-events/1085/" 200 0.69s
[2021-06-08 00:59:24] INFO : "GET https://www.python.org/events/python-events/833/" 200 0.79s
[2021-06-08 00:59:24] INFO : Spider exited; total running time 1.12s.
PyFest will be held on "From 16 June through 18 June, 2021"
EuroPython 2021 will be held on "From 26 July through 01 Aug., 2021"
PyCon Namibia 2021 will be held on "From 18 June through 19 June, 2021"
PyOhio 2021 will be held on "31 July, 2021"
SciPy 2021 will be held on "From 12 July through 18 July, 2021"
There are some detailed examples in the directory /examples.
class Request(url: str,
method: str = 'GET',
callback: Optional[Callable] = None,
session: Union[bool, dict] = False,
state: Optional[dict] = None,
headers: Optional[dict] = None,
cookies: Optional[dict] = None,
params: Optional[dict] = None,
data: Optional[dict] = None,
json: Optional[dict] = None,
files: Optional[dict] = None,
proxies: Optional[dict] = None,
verify: bool = True,
timeout: Optional[Union[Tuple[Number, Number], Number]] = None,
**kwargs)The popular HTTP library requests is used under the hood. Please refer to its documentation: requests.request.
It accepts some extra parameters:
Parameters:
- callback: It will be used to handle response to this request. The default value is
self.parse. - session: It provides cookie persistence, connection-pooling, and configuration. It can be a
Boolordict. The default value isFalse, that means no new requests.Session will be created.- If set to
True, a requests.Session object is created, and subsequent requests will be sent under the same session. - if set to a
dict, in addition to the above, the value can be used to provide default data to theRequest. For example:session={'auth': ('user', 'pass'), 'headers': {'x-test': 'true'}}.
- If set to
- state: It is shared between a request and the corresponding response.
This object contains a server’s response to an HTTP request. Actually it is the same object as requests.Response.
Several attributes and methods are attached to this object:
Attributes:
- req: The
Requestobject. - state: The same object that was passed by a
Request.
Methods:
-
select(self, selector: str, **kw) -> List[bs4.element.Tag]Perform a CSS selection operation on the HTML element. The powerful HTML parser Beautiful Soup is used.
Base class for spiders. All spiders must inherit from this class.
Class attributes:
-
WORKERS
Default:
os.cpu_count() * 2The number of concurrent requests that will be performed by the downloader.
-
TIMEOUT:
Default:
30The amount of time (in secs) that the downloader will wait before timeout.
-
DOWNLOAD_DELAY
Default:
0The amount of time (in secs) that the downloader should wait before download.
-
RANDOM_DOWNLOAD_DELAY
Default:
TrueIf enabled, the downloader will wait a random time (0.5 * delay ~ 1.5 * delay by default) before downloading the next page.
-
RETRY_TIMES
Default:
3Maximum number of times to retry when encountering connection issues or unexpected status codes.
-
RETRY_CODES:
Default:
(500, 502, 503, 504, 408, 429)HTTP response status codes to retry. Other errors (DNS or connection issues) are always retried.
502: Bad Gateway, 503: Service Unavailable, 504: Gateway Timeout, 408: Request Timeout, 429: Too Many Requests.
-
RETRY_DELAY
Default:
1The amount of time (in secs) that the downloader will wait before retrying a failed request.
-
DEFAULT_HEADERS
Default:
{'User-Agent': 'mocy/0.1'}
Attributes:
entry: Union[str, Request, Iterable[Union[str, Request]], Callable] = []
Methods:
-
entry() -> Union[str, Request, Iterable[Union[str, Request]], Callable] = [] -
on_start(self) -> NoneCalled when the spider starts up.
-
on_finish(self) -> NoneCalled when the spider exits.
-
on_error(self, reason: SpiderError) -> NoneCalled when the spider encounters an error when downloading or parsing. It may be called multiple times.
-
parse(self, res: Response) -> AnyParse a response and generate some data or new requests.
-
collect(self, item: Any) -> AnyCalled when the spider outputs a result. Usually it will be called multiple times.
-
collect(self, item: Any, res: Response) -> AnyCalled when the spider outputs a result. Usually it will be called multiple times.
-
start(self) -> NoneStarts up the spider. It will keep running until all requests were processed.
The decorators can be applied to multiple methods of the Spider class. They are called in the same order as they were defined.
-
before_downloadThe decorated method is used to modify request objects. If it does not return the same or a new
Requestobject, the passed request will be ignored. -
after_downloadThe decorated method is used to modify response objects. If it does not return the same or a new
Responseobject, the passed response will be ignored. If it returns aRequestobject, then the object will be added to the request queue. -
pipeThe decorated method is used to process yielded items. If it returns
None, the item won't be passed to the next pipeline.Spider.collectis a default pipe.
-
class SpiderError(msg: str, cause: Optional[Exception] = None)Base Class for spider errors. The following exceptions inherit from this class.
Attributes:
- msg: a brief text that explains what happened.
- cause: the underlying exception that raised this error.
- req: the
Requestobject. - res: the
Responseobject; it may beNone.
-
class RequestIgnored(url: str, cause: Optional[Exception] = None)Indicates a decision was made not to process a request.
-
class ResponseIgnored(url: str, cause: Optional[Exception] = None)Indicates a decision was made not to process a response.
-
class DownLoadError(url: str, cause: Optional[Exception] = None)Indicates an error when downloading.
-
class ParseError(url: str, cause: Optional[Exception] = None)Indicates an error when parsing.
$ pytestMIT