スクレイピングにFireFoxを使用する¶

Here is a list of tips and advice on using Firefox for scraping, along with a list of useful Firefox add-ons to ease the scraping process.

ライブブラウザDOMの検査に関する警告¶

Since Firefox add-ons operate on a live browser DOM, what you’ll actually see when inspecting the page source is not the original HTML, but a modified one after applying some browser clean up and executing Javascript code. Firefox, in particular, is known for adding <tbody> elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use <tbody> in your XPath expressions.

Therefore, you should keep in mind the following things when working with Firefox and XPath:

Disable Firefox Javascript while inspecting the DOM looking for XPaths to be used in Scrapy
Never use full XPath paths, use relative and clever ones based on attributes (such as id, class, width, etc) or any identifying features like contains(@href, 'image').
Never include <tbody> elements in your XPath expressions unless you really know what you’re doing

スクレイピングに便利なFirefoxアドオン¶

Firebug¶

Firebug is a widely known tool among web developers and it’s also very useful for scraping. In particular, its Inspect Element feature comes very handy when you need to construct the XPaths for extracting data because it allows you to view the HTML code of each page element while moving your mouse over it.

See スクレイピングにFirebugを使用する for a detailed guide on how to use Firebug with Scrapy.

XPather¶

XPather allows you to test XPath expressions directly on the pages.

XPath Checker¶

XPath Checker is another Firefox add-on for testing XPaths on your pages.

Tamper Data¶

Tamper Data is a Firefox add-on which allows you to view and modify the HTTP request headers sent by Firefox. Firebug also allows to view HTTP headers, but not to modify them.

Firecookie¶

Firecookie makes it easier to view and manage cookies. You can use this extension to create a new cookie, delete existing cookies, see a list of cookies for the current site, manage cookies permissions and a lot more.