How to create a basic web crawler with Scrapy
Random Tech News

How to create a basic web crawler with Scrapy


Programs that read information from a web page or web crawler, have all sorts of useful applications.

Writing these web crawler programs is easier than you think. Python has an excellent library for writing scripts that extract information from web pages. Let’s see how to create a web crawler with Scrapy through the following article!

Use Scrapy to create web crawlers

  • Install Scrapy
  • How to create a web crawler

    • Turn off logging
    • Use Chrome Inspector
    • Extract title
    • Find description
  • Collect JSON data
  • Exploit many factors

Install Scrapy

Scrapy is a Python library created to scan and build web crawlers. It is fast, simple and can navigate through many websites without a lot of effort.

Scrapy is available through the Pip Installs Python library (PIP). For how to install PIP, please refer to the article: Installing Python Package with PIP on Windows, Mac and Linux.

Using the Python virtual environment is preferred because it will allow you to install Scrapy in a virtual directory and keep the file system intact. The Scrapy documentation recommends doing this for best results.

Create directories and initialize a virtual environment.

mkdir crawler
 cd crawler
 virtualenv venv
 . venv/bin/activate

Now you can install Scrapy into that directory using the PIP command.

pip install scrapy

Quick check to make sure Scrapy is properly installed:

scrapy
 # prints
 Scrapy 1.4.0 - no active project
 
 Usage:
 scrapy  [options] [args]
 
 Available commands:
 bench Run quick benchmark test
 fetch Fetch a URL using the Scrapy downloader
 genspider Generate new spider using pre-defined templates
 runspider Run a self-contained spider (without creating a project)
 ...
Scrapy
Scrapy

How to create a web crawler

Now that the environment is ready, you can start creating web crawlers. Take a look at some information from the Wikipedia page about batteries:

https://en.wikipedia.org/wiki/Battery_(electricity)

The first step to writing a crawler is to identify a Python class that extends from Scrapy.Spider. This gives you access to all functions in Scrapy. Call this class spider1.

The spider class needs some information:

  • Name (name) to identify spiders
  • Out start_urls contains a list of URLs to crawl (Wikipedia URLs will be examples in this tutorial)
  • Method parse () Used to process web pages and extract information.
import scrapy
 
 class spider1(scrapy.Spider):
 name = 'Wikipedia'
 start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)']
 
 def parse(self, response):
 pass

A quick test to make sure everything is running properly.

scrapy runspider spider1.py
 # prints
 2017-11-23 09:09:21 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
 2017-11-23 09:09:21 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
 2017-11-23 09:09:21 [scrapy.middleware] INFO: Enabled extensions:
 ['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',

Turn off logging

Running Scrapy with this class will export log information which will not help at the moment. Make things simple by removing this redundant log information. Add the following code to the beginning of the file.

import logging
 logging.getLogger('scrapy').setLevel(logging.WARNING)

Now when you run the script again, the log information will not be printed.

Use Chrome Inspector

Everything on a web page is stored in HTML elements. Elements are organized in the Document Object Model (DOM). Understanding the DOM is important to make the most of the web crawler. Web crawlers search through all the HTML elements on a page to find information, so it’s important to understand how they are organized.

Google Chrome has tools to help you find HTML elements faster. You can position the HTML in any element you see on the web with the inspector.

  • Navigate to a page in Chrome
  • Place your mouse over the element you want to see
  • Right click and select Inspect from the menu

These steps will open the developer console with the tab Elements selected. At the bottom of the console, you will see a tree diagram containing the elements. This tree diagram is how you will get information for the script.

Extract title

Use the script to do some work. Gather simple information to get the site’s title text.

Start the script by adding some code to the parse () method to extract the title.

...
 def parse(self, response):
 print response.css('h1#firstHeading::text').extract()
 ...

Argument response support a method called CSS () Select components from the page using the location you provide.

In this example, the element is h1.firstHeading. More :: text The script is what is needed to give you the content of the element. Finally, the method extract () returns the selected element.

Run this script in Scrapy to export the title in text form.

[u'Battery (electricity)']

Find description

We have now extracted the title text. Do more with scripts. The crawler will find the first paragraph after the title and extract this information.

Here is an element tree diagram in the Chrome Developer Console:

div#mw-content-text>div>p

Right arrow (>) denotes a parent-child relationship between elements. This position will return all elements p match, including the entire description. To get the element p First, you can write this code:

response.css('div#mw-content-text>div>p')[0]

Like the title, you add :: text to get the text content of the element.

response.css('div#mw-content-text>div>p')[0].css('::text')

The last expression to use extract () to return the list. You can use the Python function join () to pair the list after the crawl is complete.

def parse(self, response):
 print ''.join(response.css('div#mw-content-text>div>p')[0].css('::text').extract())

The result is the first paragraph of the text!

An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is
 ...

Collect JSON data

Scrapy can extract information in text form, very useful. Scrapy also allows you to view JavaScript Object Notation (JSON) data. JSON is a neat way to organize information and is widely used in web development. JSON also works quite nicely with Python.

When you need to collect data in JSON format, you can use the statement yield integrated in Scrapy.

Here, a new version of the script uses the command yield. Instead of taking the element p First in text format, this statement will retrieve all the elements p and arrange it in JSON format.

...
 def parse(self, response):
 for e in response.css('div#mw-content-text>div>p'):
 yield { 'para' : ''.join(e.css('::text').extract()).strip() }
 ...

Now you can run spider by specifying the file JSON output:

scrapy runspider spider3.py -o joe.json

The script will now output all elements p.

[
 {"para": "An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is the cathode and its negative terminal is the anode.[2] The terminal marked negative is the source of electrons that when connected to an external circuit will flow and deliver energy to an external device. When a battery is connected to an external circuit, electrolytes are able to move as ions within, allowing the chemical reactions to be completed at the separate terminals and so deliver energy to the external circuit. It is the movement of those ions within the battery which allows current to flow out of the battery to perform work.[3] Historically the term "battery" specifically referred to a device composed of multiple cells, however the usage has evolved additionally to include devices composed of a single cell.[4]"},
 {"para": "Primary (single-use or "disposable") batteries are used once and discarded; the electrode materials are irreversibly changed during discharge. Common examples are the alkaline battery used for flashlights and a multitude of portable electronic devices. Secondary (rechargeable) batteries can be discharged and recharged multiple
 ...

Exploit many factors

So far, the web crawler has tapped the title and a type of element from the page. Scrapy can also extract information from different types of elements in a script.

Feel free to extract the top IMDb Box Office hits for the weekend. This information is taken from http://www.imdb.com/chart/boxoffice, in a table with rows for each metric.

Method parse () can extract multiple fields from row. Using Chrome Developer Tools, you can find nested elements in the table.

...
 def parse(self, response):
 for e in response.css('div#boxoffice>table>tbody>tr'):
 yield {
 'title': ''.join(e.css('td.titleColumn>a::text').extract()).strip(),
 'weekend': ''.join(e.css('td.ratingColumn')[0].css('::text').extract()).strip(),
 'gross': ''.join(e.css('td.ratingColumn')[1].css('span.secondaryInfo::text').extract()).strip(),
 'weeks': ''.join(e.css('td.weeksColumn::text').extract()).strip(),
 'image': e.css('td.posterColumn img::attr(src)').extract_first(),
 }
 ...

Selector image specified that img is descendant of td.posterColumn. To extract the correct attribute, use the expression ::attr (src).

Running spider returns JSON:

[
 {"gross": "$93.8M", "weeks": "1", "weekend": "$93.8M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYWVhZjZkYTItOGIwYS00NmRkLWJlYjctMWM0ZjFmMDU4ZjEzXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Justice League"},
 {"gross": "$27.5M", "weeks": "1", "weekend": "$27.5M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYjFhOWY0OTgtNDkzMC00YWJkLTk1NGEtYWUxNjhmMmQ5ZjYyXkEyXkFqcGdeQXVyMjMxOTE0ODA@._V1_UX45_CR0,0,45,67_AL_.jpg", "title": "Wonder"},
 {"gross": "$247.3M", "weeks": "3", "weekend": "$21.7M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BMjMyNDkzMzI1OF5BMl5BanBnXkFtZTgwODcxODg5MjI@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Thor: Ragnarok"},
 ...
 ]

.

Leave a Reply

Your email address will not be published. Required fields are marked *