OnCrawl’s crawler is the core component of our solution. It is used to collect massive amounts of data around the web. It is built on top of distributed storage and computing frameworks (HDFS, Hadoop and Spark). Our crawler analyses billions of web pages monthly, with JavaScript rendering, and hundreds of billions of links in the meantime.
As a member of the Crawler team, you will tackle issues such as network congestions, load balancing, bandwidth optimisation, cost reduction and other big data infrastructure related difficulties.