Web Scrapping and Crawling Approach. How?

Essentially there are many approach to scraps. It depends on the client requirement and the goal.

My favorite stack for scrapping are nodejs, cheerio,phantomjs, jquery, download, request, curl and bash command. With hands-on experience , my understanding on the format and common heuristics to how each crawler and scrapper use deduced to few common heuristics.

The heuristics choice for scrappers commonly would be

A) Live Crawl and Scrap, then Download

For me this choice , is very much needed when

I) a page has a lot of live scripts interaction
As example, some data in this page not loaded until user perform clicks or focus or some other interaction event. To crawl and scrap the data is possible BUT you need to figure out how the script assembles the data. It is much easier to use script like phantomjs ,casperjs to artificially invoke those events[click,focus,blur,etc] and capture its callback data.
Also, for pages like a form which has many sequential steps dependent, we definitely need to use this approach.

II) single page application data-binding which loads much later

Some sites are getting more clever, they bind the data and load the core main data much later using tools techniques like lazy-loading. With such, when the pages load first,most data is not available until the different sections of pages invoke multiple mini xhr ajax requests to fetch data later on via event like load or ready or scroll. For example, when you first load this page, you view the html, it is blank content with only data property with random names each time you load. Imagine reactjs plus data scrambler.

B) Download then Scrap [Preferred]

For me this choice , is very much preferred because we can manage easily the parallel task. One schedule cronjob task command to just invoke script bot to download, while in parallel another bot script runs to read,parse and extract and format [etl].
For this the ideal case usually for

I) A listing and details site with not much dynamic data interaction
In this case, static data with information plainly listed. We can download all the information pages as offline text or html , then use tool like cheeriojs to parse later. This is ideal to track each steps of scrapping and missing. Most common is internet connection issue, to be able to read offline after downloaded is obviously better. The argument case here would be too much file downloaded takes too much space. Obviously, then we can write a bash or cmd scripting to remove the files after processed automatically via scheduled cron job.

II) Heavy load of data
If we try to parse live data from sites to extract alot information, it can be obvious, if we do not set timer or use vpn or other protective measures, our crawling and scrapping motive become crystal clear to the server site host or proxy checker. In those cases, your ip can get banned either temporarily or permanently. Also, sometimes in live crawl and scrap, we over-exceed the requests unknowingly and if connection lost, we have no copy causing us to re-try connection attempt which expose us to high risk of data loss and crawl intentions.
Therefore with this approach, we can separate two parts script; partA script downloads using timer and managed scheduled batch, partB script crawls and ETL as fast as possible from the offline download files. So because we modularize the part, partA script can run across different computers. The artifacts offline files from partA can be assembled and then aggregated, while partB can crawl as fast it cans from these collected offline files without worrying of being banned or detected. Not too mention, we have flag identifier to make sure we marked processed and downloaded files to effectively manage.

← Previous post

Next post →

1 Comment

  1. Bookmarked this website page, will come back for more articles.

Leave a Reply