Latest lessons learnt from crawling

Lessons learnt

I just realised that there’s a quick way to understand the xpaths’ patterns.

In the past, usually what I did is to manually eyeball to infer the patterns from the page source or inspect page.

Silly me!

1 quick way to understand the pattern is through the following,

  • Right click on an element in a web page that you are interested in and click on ‘inspect’
  • Right click on the node and click ‘copy’
  • Copy full xpath

And paste to a notepad.

Do it for the series of elements that you wish to crawl.

And presto you are able to observe the patterns!

Case in point

With the increment in indexes in the following paths, I could easily do a loop to form the xpaths and feed it into python xpath function and do an ‘extract_first()‘!

/html/body/div[4]/div[2]/div[1]/div[1]/div/div[2]/div[11]/div[1]/div/div/div/div[1]/div/div[1]/div[3]/a
/html/body/div[4]/div[2]/div[1]/div[1]/div/div[2]/div[11]/div[2]/div/div/div/div[1]/div/div[1]/div[3]/a
/html/body/div[4]/div[2]/div[1]/div[1]/div/div[2]/div[11]/div[3]/div/div/div/div[1]/div/div[1]/div[3]/a
/html/body/div[4]/div[2]/div[1]/div[1]/div/div[2]/div[11]/div[4]/div/div/div/div[1]/div/div[1]/div[3]/a

Related

comments powered by Disqus