Scraping Daraz: Writing a Python WebScraper For Daraz.pk

3 min readJan 30, 2024

While trying to understand the talking point these days — psst. ML/DS — I found myself looking at terms that would instantly make me look like that guy from The Matrix if I ever ushered them into my friends circle. My basic understanding: You got some data, You do some computer magic on it, et viola, Data Science!

Disclaimer: This webscraper was built for exploratory purposes. I’m just a rookie programmer. Keep those guys in suits away from me!

But what truly caught my attention was the question about the origins of this data. It’s not a data entry clerk doing their 9-to-5 routine collecting all this stuff on their mean old registers (Trust me, this is pretty reasonable in my country). It must involve programmers! (or the other guy). I went to my savior — Google. A little bit of searching eased my tension. Web Scraping is how this data is collected and — thank God! — programming is involved in the automation. I was amazed and learned more about it in the following days.

After getting some hands-on experience in Web Scraping in Python with the Beautiful Soup Library, I was hopeful I could even scrape data from Amazon. My dreams soon shattered when I found out that the requests.get() function — I was using to fetch the HTML of websites — wouldn’t process JavaScript. Thus, websites written in modern frameworks won’t serve their <div> tags using requests get. I realized it was time to bring out the big guns.

My ambitions for making an Amazon web scraper were too far-fetched. So I stuck to something smaller but similar: Daraz, The biggest e-commerce platform in Pakistan. Their website is also written in modern JavaScript. Therefore, using BeautifulSoup was a big no-no. What piqued my mind was to figure out a way to fetch the final rendered HTML of the pages and then transfer that to BeautifulSoup for processing. That’s where Selenium came into the picture.

Selenium is a web testing and automation tool. It can render modern webpages and also read their source. I planned to render the product page using Selenium and transfer the page source to Beautiful Soup for processing. I could’ve done the entire thing in Selenium but spending hours trying to learn more would’ve led to a brain melt. I knew Beautiful Soup enough so I stuck to using that.

So, the final plan:

Fetch the product page HTML using Selenium.
Load the HTML String to Beautiful Soup for Processing.
Mine Product Data from the Soup.
Write the Data into a CSV.

  import csv  
  from bs4 import BeautifulSoup
  from selenium import webdriver
  driver = webdriver.Chrome()
  
  query = "Rubiks Cube" # Search Term On Daraz
  pages = 2 # Number of Pages
  query = '+'.join(query.split()) # Transform query into Daraz Search Term
  
  file_name = open(f'daraz-products-{'-'.join(query.split('+'))}.csv', "w")
  writer = csv.writer(file_name)
  writer.writerow([f"Search Term: {' '.join(query.split('+'))}", 'Rating', 'Items Sold', 'Selling Price', 'List Price'])
  for page in range(1, pages+1):
      driver.get(f"https://www.daraz.pk/catalog/?page={page}&q={query}")
      soup = BeautifulSoup(driver.page_source)
      grid_item_class = soup.find('div', {'data-qa-locator':'general-products'}).find('div')['class']
      for item_div in soup.findAll('div', class_=grid_item_class):
          item_div = item_div.findAll('div')[3]
          item_name = item_div.find('div', id='id-title').text
          try:
              if '/' not in item_div.findAll('span')[1].text:
                  item_ratings = 'N/A'
              else:
                  item_ratings = item_div.findAll('span')[1].text
                  items_sold = item_div.findAll('div')[1].findAll('div')[-1].text
          except:
              item_ratings = '-'
          if item_div.find('div', id='id-price').find('del').text == "":
              item_original_price = "-"
          else:
              item_original_price = item_div.find('div', id='id-price').find('del').text.strip('Rs. ')
          item_list_price = item_div.find('div', id='id-price').findAll('span')[1].text
          writer.writerow([item_name, item_ratings, items_sold, item_list_price, item_original_price])
  
  file_name.close()

This is the final code I came up with. It ain’t that pretty but it gets the job done. A bit of summary to what this code does:

Iterate over Pages and read their Page Source
Pass the Page Source to Beautiful Soup
Beautiful Soup searches for <div> tags having a specific attribute key-value pair. In our case, that attribute is “data-qa-locator” with the value of “general-products’.
Iterate over the search results and extract product related data from it.
Write the product data into a CSV file using Python’s in-built csv library.

In the end the code works. It writes the search query results into a csv file which can then be either be read by a csv viewer or loaded into Pandas for data analysis and exploration.

Here’s a glimpse to what I was able to do from the above code:

Scraping Daraz: Writing a Python WebScraper For Daraz.pk

Written by Ahmad Daud

No responses yet