Scraping Daraz: Writing a Python WebScraper For Daraz.pk
While trying to understand the talking point these days — psst. ML/DS — I found myself looking at terms that would instantly make me look like that guy from The Matrix if I ever ushered them into my friends circle. My basic understanding: You got some data, You do some computer magic on it, et viola, Data Science!
Disclaimer: This webscraper was built for exploratory purposes. I’m just a rookie programmer. Keep those guys in suits away from me!
But what truly caught my attention was the question about the origins of this data. It’s not a data entry clerk doing their 9-to-5 routine collecting all this stuff on their mean old registers (Trust me, this is pretty reasonable in my country). It must involve programmers! (or the other guy). I went to my savior — Google. A little bit of searching eased my tension. Web Scraping is how this data is collected and — thank God! — programming is involved in the automation. I was amazed and learned more about it in the following days.
After getting some hands-on experience in Web Scraping in Python with the Beautiful Soup Library, I was hopeful I could even scrape data from Amazon. My dreams soon shattered when I found out that the requests.get() function — I was using to fetch the HTML of websites — wouldn’t process JavaScript. Thus, websites written in modern frameworks won’t serve their <div> tags using requests get. I realized it was time to bring out the big guns.
My ambitions for making an Amazon web scraper were too far-fetched. So I stuck to something smaller but similar: Daraz, The biggest e-commerce platform in Pakistan. Their website is also written in modern JavaScript. Therefore, using BeautifulSoup was a big no-no. What piqued my mind was to figure out a way to fetch the final rendered HTML of the pages and then transfer that to BeautifulSoup for processing. That’s where Selenium came into the picture.
Selenium is a web testing and automation tool. It can render modern webpages and also read their source. I planned to render the product page using Selenium and transfer the page source to Beautiful Soup for processing. I could’ve done the entire thing in Selenium but spending hours trying to learn more would’ve led to a brain melt. I knew Beautiful Soup enough so I stuck to using that.
So, the final plan:
- Fetch the product page HTML using Selenium.
- Load the HTML String to Beautiful Soup for Processing.
- Mine Product Data from the Soup.
- Write the Data into a CSV.
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
query = "Rubiks Cube" # Search Term On Daraz
pages = 2 # Number of Pages
query = '+'.join(query.split()) # Transform query into Daraz Search Term
file_name = open(f'daraz-products-{'-'.join(query.split('+'))}.csv', "w")
writer = csv.writer(file_name)
writer.writerow([f"Search Term: {' '.join(query.split('+'))}", 'Rating', 'Items Sold', 'Selling Price', 'List Price'])
for page in range(1, pages+1):
driver.get(f"https://www.daraz.pk/catalog/?page={page}&q={query}")
soup = BeautifulSoup(driver.page_source)
grid_item_class = soup.find('div', {'data-qa-locator':'general-products'}).find('div')['class']
for item_div in soup.findAll('div', class_=grid_item_class):
item_div = item_div.findAll('div')[3]
item_name = item_div.find('div', id='id-title').text
try:
if '/' not in item_div.findAll('span')[1].text:
item_ratings = 'N/A'
else:
item_ratings = item_div.findAll('span')[1].text
items_sold = item_div.findAll('div')[1].findAll('div')[-1].text
except:
item_ratings = '-'
if item_div.find('div', id='id-price').find('del').text == "":
item_original_price = "-"
else:
item_original_price = item_div.find('div', id='id-price').find('del').text.strip('Rs. ')
item_list_price = item_div.find('div', id='id-price').findAll('span')[1].text
writer.writerow([item_name, item_ratings, items_sold, item_list_price, item_original_price])
file_name.close()
This is the final code I came up with. It ain’t that pretty but it gets the job done. A bit of summary to what this code does:
- Iterate over Pages and read their Page Source
- Pass the Page Source to Beautiful Soup
- Beautiful Soup searches for <div> tags having a specific attribute key-value pair. In our case, that attribute is “data-qa-locator” with the value of “general-products’.
- Iterate over the search results and extract product related data from it.
- Write the product data into a CSV file using Python’s in-built csv library.
In the end the code works. It writes the search query results into a csv file which can then be either be read by a csv viewer or loaded into Pandas for data analysis and exploration.
Here’s a glimpse to what I was able to do from the above code: