这边我是打API爬的,所以先写了序列化:
class IgCommentsSerializer(serializers.Serializer): post = serializers.CharField(max_length=1000) poster = serializers.CharField(max_length=200)
一开始先写selenium的webserver基本设定,要抓取全部留言的话需先登入完ig,再跳转到要的post页面:
class IgComments(APIView): def __init__(self): self.path = 'chromedriver的路径' self.sbaccount = '帐号' self.sbpd = '密码' def post(self, request): options = Options() options.add_argument("--headless") # 执行时浏览器只在背景执行 driver = webdriver.Chrome(self.path, options=options) driver.implicitly_wait(3) driver.get('https://www.instagram.com/') time.sleep(2) account = driver.find_elements_by_name('username')[0] pd = driver.find_elements_by_name('password')[0] # 登入 account.send_keys(self.sbaccount) pd.send_keys(self.sbpd) driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[3]/button').click() time.sleep(3) driver.get('https://www.instagram.com/p/CYXqAMuBX0e/') # 直接跳转到post more_xpath = '//*[@id="react-root"]/section/main/div/div[1]/article/div/div[2]/div/div[2]/div[1]/ul/li/div/button/div' time.sleep(2)
ig一次只会载入12则留言(我没记错的话XD),这边需要借助selenium的力量,自动化点选更多留言按钮。而为了要抓取所有留言,这边透过while迴圈直到没有更多留言按钮可点选,之后就可以一次性地抓取所有留言文字了:
...接下上部分程式码... while True: try: time.sleep(2) driver.find_element_by_xpath(more_xpath).click() print('下一页') except: print('最后一页') break crawl_comments = [] comments = driver.find_element_by_class_name("XQXOT").find_elements_by_class_name("Mr508") n = 1 for c in comments: poster = c.find_element_by_css_selector('h3._6lAjh span').text post_xpath = f'//*[@id="react-root"]/section/main/div/div[1]/article/div/div[2]/div/div[2]/div[1]/ul/ul[{n}]/div/li/div/div/div[2]/span'.format(n=n) time.sleep(2) post = c.find_element_by_xpath(post_xpath).text crawl_comments.append({'poster':poster, 'post':post}) n+=1 ser = IgCommentsSerializer(crawl_comments, many=True) return Response(ser.data)