Stuck in loop <> Code doesn't want to pull anything except row 1

Jonth · Oct 1, 2021

I am stuck in loop, I don't know what to change to make my code work normally... problem is with CSV file, my file contains list of domains (freedommortgage.com, google.com, amd.com etc.) so when I run code, everything is fine at start, but then it keeps sending me same results all over:

the monthly total visits to freedommortgage.com is 1.10M

So here is my line:

Code:

import csv
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import urllib
from captcha2upload import CaptchaUpload
import time


# setting the firefox driver
def init_driver():
    driver = webdriver.Firefox(executable_path=r'C:\Users\muki\Desktop\similarweb_scrapper-master\geckodriver.exe')
    driver.implicitly_wait(10)
    return driver


# solving the captcha (with 2captcha.com)
def captcha_solver(driver):
    captcha_src = driver.find_element_by_id('recaptcha_challenge_image').get_attribute("src")
    urllib.urlretrieve(captcha_src, "captcha.jpg")
    captcha = CaptchaUpload("4cfd308fd703d40291a7e250d743ca84")  # 2captcha API KEY
    captcha_answer = captcha.solve("captcha.jpg")
    wait = WebDriverWait(driver, 10)
    captcha_input_box = wait.until(
        EC.presence_of_element_located((By.ID, "recaptcha_response_field")))
    captcha_input_box.send_keys(captcha_answer)
    driver.implicitly_wait(10)
    captcha_input_box.submit()


# inputting the domain in similar web search box and finding necessary values
def lookup(driver, domain, short_method):
    # short method - inputting the domain in the url 
    if short_method:
        driver.get("https://www.similarweb.com/website/" + domain)
    else:
        driver.get("https://www.similarweb.com")
    attempt = 0
    # trying 3 times before quiting (due to second refresh by the website that clears the search box)
    while attempt < 1:
        try:
            captcha_body_page = driver.find_elements_by_class_name("block-page")
            driver.implicitly_wait(10)
            if captcha_body_page:
                print("Captcha ahead, solving the captcha, it may take a few seconds")
                captcha_solver(driver)
                print("Captcha solved! the program will continue shortly")
                time.sleep(20)  # to prevent second refresh affecting the upcoming elements finding after captcha solved
        # for normal method, inputting the domain in the searchbox instead of url
            if not short_method:
                input_element = driver.find_element_by_id("js-swSearch-input")
                input_element.click()
                input_element.send_keys(domain)
                input_element.submit()
            wait = WebDriverWait(driver, 10)
            time.sleep(10)
            total_visits = wait.until(
                EC.presence_of_element_located((By.XPATH, "//span[@class='engagementInfo-valueNumber js-countValue']")))


            total_visits_line = "the monthly total visits to %s is %s" % (domain, total_visits.text)
            time.sleep(10)
            print('\n' + total_visits_line)


        except TimeoutException:
            print("Box or Button or Element not found in similarweb while checking %s" % domain)
            attempt += 1
            print("attempt number %d... trying again" % attempt)


# main
if __name__ == "__main__":
    with open('bigdomains.csv', 'rt') as f:
        reader = csv.reader(f)
        driver = init_driver()
        for row in reader:
            domain = row[0]
            lookup(driver, domain, True) # user need to give as a parameter True or False, True will activate the
            # short method, False will take the normal method

(Sorry for the long line of code, but I have to present everything, even tho focus is on the LAST PART of the code)
My question is simple:

Why does it keep taking row number 1 domain, and ignoring the row2 row3 row4, etc...?

Time = delay has to be 10, or more, to avoid captcha issue on this website

if anyone would try to run this, you have to edit name of csv file, and to have few domains in it in format google.com (not www.google.com) of course.

Alanic · Oct 1, 2021

Looks like you're always accessing the same index everytime with:

domain = row[0]

Index 0 is the first item, hence why you keep getting the same value.

This post explains an alternative way to use a for loop in Python.

Accessing the index in 'for' loops?

Stuck in loop <> Code doesn't want to pull anything except row 1

Jonth

New member

Alanic

New member