Creating a faster crawler

João Júnior

PyMalta - December 5, 2018

Sequential

Parallelism

Concurrent

API

import time
from flask import Flask

app = Flask(__name__)
TIME_FASTER, TIME_SLOWLY = 1, 10

@app.route("/faster")
def faster():
  time.sleep(TIME_FASTER); return "Faster!"

@app.route("/slowly")
def slowly():
  time.sleep(TIME_SLOWLY); return "Slowly!"

@app.route("/text")
def text():
    return "Text!"

Crawler Sequential

            import requests

from constants import URL_FASTER, URL_SLOWLY

def crawler(url):
    response = requests.get(url)
    return response.status_code

if __name__ == '__main__':
    crawler(URL_SLOWLY)
    for i in range(20):
        crawler(URL_FASTER)

Crawler Concurrent - Threads

            from threading import Thread
import requests
from constants import URL_FASTER, URL_SLOWLY

def crawler(url):
    response = requests.get(url)
    return response.status_code

threads = [Thread(target=crawler, args=(URL_SLOWLY,))]
for i in range(20):
    threads.append(Thread(target=crawler, args=(URL_FASTER,)))
for t in threads:
    t.start()
for t in threads:
    t.join()

Crawler Concurrent - GreenThreads

            import requests
import gevent.monkey
from gevent import Greenlet
from constants import URL_FASTER, URL_SLOWLY
gevent.monkey.patch_socket()

def crawler(url):
    response = requests.get(url)
    return response.status_code

gthreads = [Greenlet(crawler, URL_SLOWLY)]
for i in range(20):
    gthreads.append(Greenlet(crawler, URL_FASTER))
for gthread in gthreads: gthread.start()
gevent.joinall(gthreads)

Crawler Concurrent - AsyncIO

            import asyncio
import aiohttp
from constants import URL_FASTER, URL_SLOWLY

async def crawler(url):
    async with aiohttp.ClientSession() as session:
        response = await session.get(url)
        return response.status

loop = asyncio.get_event_loop()
futures = [asyncio.ensure_future(crawler(URL_SLOWLY))]
for i in range(20):
    futures.append(asyncio.ensure_future(crawler(URL_FASTER)))
loop.run_until_complete(asyncio.gather(*futures))

Crawler "Parallel" - Multiprocessing

            from multiprocessing import Pool
import requests
from constants import URL_FASTER, URL_SLOWLY

def crawler(url):
    response = requests.get(url)
    return response.status_code

p = Pool(21)
urls = [URL_SLOWLY]
for i in range(20):
    urls.append(URL_FASTER)
p.map(crawler, urls)

Benchmark #1 - Graph #1

Benchmark #1 - Graph #2

Benchmark #1 - Graph #3

Benchmark #1 - Graph #4

Benchmark #1 - Graph #5

## **Benchmark #1**
          <table style="width:100%" class="reveal">
            <tr>
              <th>Crawler</th>
              <th>Min</th>
              <th>Max</th>
              <th>Difference</th>
              <th>(Difference/Min)\*100</th>
            </tr>
            <tr>
              <td>Threads</td>
              <td>**10.12**</td>
              <td>12.17</td>
              <td>2.06</td>
              <td>20.33%</td>
            </tr>
            <tr>
              <td>Green Threads</td>
              <td>10.24</td>
              <td>10.40</td>
              <td>0.17</td>
              <td>1.63%</td>
            </tr>
            <tr>
              <td>AsyncIO</td>
              <td>10.26</td>
              <td>10.42</td>
              <td>**0.15**</td>
              <td>**1.50%**</td>
            </tr>
            <tr>
              <td>Multiprocessing</td>
              <td>10.20</td>
              <td>14.29</td>
              <td>4.09</td>
              <td>40.14%</td>
            </tr>
          </table>

Benchmark #2 - Graph #1

Benchmark #2 - Graph #2

Benchmark #2 - Graph #3

Benchmark #2 - Graph #4

Benchmark #2 - Graph #5

Benchmark #3 - Graph #1

Benchmark #3 - Graph #2

Benchmark #3 - Graph #3

Benchmark #3 - Graph #4

Benchmark #3 - Graph #5

## **Before finish**
          <table style="width:100%" class="reveal">
            <tr>
              <th>Code</th>
              <th>Easy to Scale?</th>
              <th>I/O Control</th>
              <th>Context Switch</th>
            </tr>
            <tr>
              <td>Sequential</td>
              <td>No</td>
              <td>SO</td>
              <td>SO</td>
            </tr>
            <tr>
              <td>Multiprocessing</td>
              <td>No</td>
              <td>SO</td>
              <td>SO</td>
            </tr>
            <tr>
              <td>Threading</td>
              <td>No</td>
              <td>SO</td>
              <td>SO</td>
            </tr>
            <tr>
              <td>Gevent</td>
              <td>Yes</td>
              <td>Application</td>
              <td>Application/SO</td>
            </tr>
            <tr>
              <td>AsyncIO</td>
              <td>Yes</td>
              <td>In Your hand</td>
              <td>In Your hand/SO</td>
            </tr>
          </table>

## References

* [github.com/joaojunior/talk_creating_faster_crawler](https://github.com/joaojunior/talk_creating_faster_crawler)
          * [Official Documentation](https://docs.python.org/3/library/asyncio.html) - asyncio
          * [Official Documentation](https://docs.python.org/3.7/library/threading.html) - threading
          * [Official Documentation](https://docs.python.org/3.7/library/multiprocessing.html) - multiprocessing
          * [Official Documentation](https://greenlet.readthedocs.io/en/latest/) - greenlet
          * [Official Documentation](https://aiohttp.readthedocs.io/en/stable/) - aiohttp
          * [Official Documentation](https://github.com/MagicStack/uvloop) - uvloop
          * [Official Documentation](http://docs.python-requests.org/en/master/) - requests
          * [Understanding the Python GIL](http://www.dabeaz.com/GIL/) - David Dabeaz
          * How to run a stable benchmark - Victor Stinner [video](https://www.youtube.com/watch?v=JA7xe9X_1MM),
          [slides](https://speakerdeck.com/haypo/how-to-run-a-stable-benchmark)