**近期参加了一个项目,主要是负责数据部分,现在这个时候也没我什么事情了,写一下项目中用到的爬虫总结。**  # **0.总结** ## **1)pymysql连接数据库** ```python config = { "host": "xxx.com", "port": 0, "user": "xxx", "password": "xxx", "database": "xxx" } ``` ```python db = pymysql.connect(**config) cursor = db.cursor() #获取游标 sql = "xxx" #要执行的sql语句 cursor.execute(sql) #执行sql db.commit() # 提交数据 cursor.close() db.close()#关闭游标和数据库 ```   单次执行一条sql语句效率不如批量执行多条,也就是db.commit()可以在for循环最后执行。 使用数据库后应关闭游标和数据库 ## **2)request.get()在长时间运行后报错** 可以运用get中的参数timeout= 和try来进行再次尝试,一般一到两次即可成功,网上查阅了很多资料,目前只有这一种管用。 ```python i = 0 while i < 5: try: raw = requests.get(link, headers=header, proxies=proxies, timeout=2).text break except requests.exceptions.RequestException: i = i + 1 print('请求超时,正在重试...2') ``` ## **3)代理ip** 爬酷推时遇到了限制ip访问的情况,可以选用代理ip。 [https://www.xiaoxiangdaili.com/](https://www.xiaoxiangdaili.com/)** 短效ip 一块钱一天,还挺合算 ```python def changeip(): ip = requests.get('xxx').text while ip=='{"code":1010,"success":false,"data":null,"msg":"请求过于频繁"}': time.sleep(5) ip = requests.get('xxx').text print("更换ip成功" + ip) proxies = { 'http': ip, 'https': ip } return proxies ``` ``` ``` **在request.get()函数中只需要带上proxies=proxies即可** raw = requests.get(link, headers=header, proxies=proxies, timeout=2) ## **4)保存数据到数据库时遇到的代码报错** **原因是数据库用的字符格式是utf8,而emoji表情大多需要utf8mb4格式,其中包括utf8,遇到该类错误首先更改数据库字符格式,其次将emoji替换,可以用正则也可以用emoji包,最后再无法解决可以用replace替换** 附上从网上找到的正则替换代码 ```python def filter_emoji(desstr, restr=''): # 过滤表情 try: co = re.compile(u'[\U00010000-\U0010ffff]') except re.error: co = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]') return co.sub(restr, desstr) ``` ## **5)beautifulsoup** request到的数据先用bs4解析 ```python soup = BeautifulSoup(raw, "html.parser") ``` soup.select真是个好东西,下面是他的选择方法  ## **6)正则** **主要是用到re.compile()和re.findall()以及re.sub()** **re.compile():定义正则规则** **re.findall():查找 参数为正则规则,数据,返回值为列表** **re.sub():替换 参数为正则规则,要替换成的数据,数据** # **1.简书** **简书的爬虫是我这三个爬虫中最好写的一个,没有任何反爬,用F12确定元素定位后,可以用正则,可以用bs4解析即可获得内容。**  **不过写简书的翻页请求数据开始花了点功夫,请求的页数数据要写在post请求数据包里,用一个for循环即可解决,还有一个难点就是一开始数据出现大量重复,查阅资料后发现需要在request请求头里更新cookie数据,否则数据会出现重复,降低效率。** **ajax异步加载需要用F12抓包获取请求地址** **数据整理替换去重时使用正则替换re.sub()和text.replace()** ```python # -*- codeing = utf-8 -*- # @Time : 2021/8/3 16:17 # @Author : yunqi # @Fire : jianshu.py # @Software : PyCharm import os import requests from bs4 import BeautifulSoup import re import time import xlwt import pymysql global header,config,cookie header = { 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'X-PJAX': 'true' # 'sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217b3466849938b-08f7999168164d-4343363-2073600-17b3466849ace1%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%7D%7D; sajssdk_2015_cross_new_user=1; Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1628420549,1628422073,1628423751,1628671739; locale=zh-CN; _ga=GA1.2.1168608088.1628671739; _gid=GA1.2.896180568.1628671739; UM_distinctid=17b346685e9346-0002a4ae6d88e7-4343363-1fa400-17b346685eae44; _m7e_session_core=f67ec2211a28225502a4d2b81632bcc0; signin_redirect=https%3A%2F%2Fwww.jianshu.com%2F; read_mode=day; default_font=font2; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1628682890; CNZZDATA1279807957=288995723-1628668245-%7C1628679045; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217b3466849938b-08f7999168164d-4343363-2073600-17b3466849ace1%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217b3510ac62318-05af7b13d1ca09-4343363-2073600-17b3510ac636e3%22%7D' } config = { } global workbook, worksheet global id, postdata postdata='' def main(): global cookie,header response = requests.get("https://www.jianshu.com/", headers=header) cookie = requests.utils.dict_from_cookiejar(response.cookies) global workbook, worksheet, id #startexcel() id=0 #=====================正则==== auth=re.compile(r'(.*?)') textre=re.compile(r'') img1=re.compile(r'\n\n\n\n\n.*?\n') timere=re.compile(r'data":{"is_author":.*?,"last_updated_at":(.*?),"public_title"') passre=re.compile('<.*?>') #===================== for num in range(1,15): #简书首页最多获取15页内容 links,seenid,jianjies=GetUrl(num) for v,i in enumerate(links): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "SELECT nid FROM article WHERE nid=" + seenid[v] # 查询文章绝对id是否存在 msg = cursor.execute(sql) cursor.close() db.close() if msg==1: print("该条数据已经存在") continue try: i = "https://www.jianshu.com" + i id += 1 raw = requests.get(i, headers=header).text #获取文章详情界面的内容 soup = BeautifulSoup(raw, "html.parser") title=soup.select('h1') #获取文章标题 auther=re.findall(auth,str(soup))[0] #获取作者 time1=int(re.findall(timere,str(soup))[0]) timeArray = time.localtime(time1) Time = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) #从时间戳转换到发布日期 nid=seenid[v] jianjie=jianjies[v] except IndexError: print('获取数据失败 跳过 可能是文章正在审核') print(i) id-=1 continue #----------对内容进行简单的处理 text=str(soup.select('article')[0]) text=re.sub(textre,"",text) text=re.sub(img1,"[https:",text) text = re.sub(img2, "]", text) text=text.replace("","").replace("","\n").replace("","\n").replace("","\n").replace("","\n").replace("",'\n').replace("",'\n') #内容替换 text = re.sub(passre, " ", text) data=(title[0].text,auther,jianjie,text,Time,i,nid) #SaveDataxls(data) print(data) SaveMyql(data) os.system('python Article_Keywords.py') # 保存 #workbook.save('data.xls') #time.sleep(30) # 休息三十秒 def GetUrl(num): #从简书首页获取文章链接列表 #======================================正则 seenidre=re.compile(r'') getlink=re.compile(r'.*?')#获取文章links的正则 jianjiere=re.compile(r'\n (.*?)\n ') #===================================== global header,cookie global postdata print("page=" + str(num) + postdata) raw = requests.post("https://www.jianshu.com/trending_notes",data="page="+str(num)+postdata,headers=header,cookies=cookie).text #对page进行更改可以获取不同页数 urls=re.findall(getlink,raw) seenid=re.findall(seenidre,raw) jianjie=re.findall(jianjiere,raw) for esid in seenid: postdata=postdata+'&seen_snote_ids%5B%5D='+str(esid) return (urls,seenid,jianjie) #返回七个文章链接 每次加载七个 def SaveMyql(data): global config db = pymysql.connect(**config) cursor = db.cursor() #获取游标 sql = "INSERT INTO article(title,author,summary,content,create_time,link,nid) VALUES"+str(data) cursor.execute(sql) db.commit() # 提交数据 print("success") cursor.close() db.close() def SaveDataxls(data): global workbook, worksheet dataid=int(data[0]) # 写入excel # 参数对应 行, 列, 值 worksheet.write(dataid, 0, str(dataid)) worksheet.write(dataid, 1, data[1]) worksheet.write(dataid, 2, data[2]) worksheet.write(dataid, 3, data[3]) worksheet.write(dataid, 4, data[4]) worksheet.write(dataid, 5, data[5]) worksheet.write(dataid, 6, data[6]) worksheet.write(dataid, 7, data[7]) def startexcel(): global workbook, worksheet # 创建一个workbook 设置编码 workbook = xlwt.Workbook(encoding='utf-8') # 创建一个worksheet worksheet = workbook.add_sheet('data') # 参数对应 行, 列, 值 初始化 worksheet.write(0, 0, "id") worksheet.write(0, 1, "title") worksheet.write(0, 2, "auther") worksheet.write(0, 3, "summary") worksheet.write(0, 4, "content") worksheet.write(0, 5, "update") worksheet.write(0, 6, "source") worksheet.write(0, 7, "nid") if __name__ == "__main__": main() ``` # **2.少数派** **少数派这个网站还是项目里其他组员推荐的,于是就改了改简书的爬虫把它写下来了,基本大同小异,甚至比简书少了一步post时的数据校验。** ```python # -*- codeing = utf-8 -*- # @Time : 2021/8/3 16:17 # @Author : yunqi # @Fire : ssp.py # @Software : PyCharm import os import requests from bs4 import BeautifulSoup import re import time import xlwt import pymysql import json global header,config,cookie header = { 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', 'X-PJAX': 'true' # 'sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217b3466849938b-08f7999168164d-4343363-2073600-17b3466849ace1%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%7D%7D; sajssdk_2015_cross_new_user=1; Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1628420549,1628422073,1628423751,1628671739; locale=zh-CN; _ga=GA1.2.1168608088.1628671739; _gid=GA1.2.896180568.1628671739; UM_distinctid=17b346685e9346-0002a4ae6d88e7-4343363-1fa400-17b346685eae44; _m7e_session_core=f67ec2211a28225502a4d2b81632bcc0; signin_redirect=https%3A%2F%2Fwww.jianshu.com%2F; read_mode=day; default_font=font2; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1628682890; CNZZDATA1279807957=288995723-1628668245-%7C1628679045; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217b3466849938b-08f7999168164d-4343363-2073600-17b3466849ace1%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217b3510ac62318-05af7b13d1ca09-4343363-2073600-17b3510ac636e3%22%7D' } global workbook, worksheet global id, postdata postdata='' def main(): global cookie,header response = requests.get("https://sspai.com/", headers=header) cookie = requests.utils.dict_from_cookiejar(response.cookies) global workbook, worksheet, id #startexcel() #=====================正则==== textre=re.compile(r'') img1=re.compile(r'", "").replace("", "\n").replace("", "\n").replace("","\n").replace("", "\n").replace("", '\n').replace("", '\n').replace("\"/>","") # 内容替换 text = re.sub(passre, " ", text) timeArray = time.localtime(creat_time[v]) Time = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) data = (title[v], author[v], jianjie[v], text, Time, i, nid[v]) # SaveDataxls(data) print(data) SaveMyql(data) except IndexError: print('获取数据失败 跳过 可能是文章正在审核') print(i) time.sleep(10) continue time.sleep(1) os.system('python Article_Keywords.py') # 保存 #workbook.save('data.xls') #time.sleep(30) # 休息三十秒 def GetUrl(num): #从简书首页获取文章链接列表 #======================================正则 urls=[] jianjie=[] title=[] author=[] creat_time=[] nid=[] #===================================== global header,cookie timeArray = str(time.time()) raw=requests.get("https://sspai.com/api/v1/article/index/page/get?limit=10&offset="+str(num*10)+"&created_at="+timeArray,headers=header,cookies=cookie) jsondata=json.loads(raw.text)['data'] for i in jsondata: nid.append(i['id']) title.append(i['title']) jianjie.append(i['summary']) author.append(i['author']['nickname']) creat_time.append(i['released_time']) urls.append('https://sspai.com/post/'+str(i['id'])) return (urls,jianjie,nid,title,author,creat_time) #返回七个文章链接 每次加载七个 def SaveMyql(data): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "INSERT INTO article(title,author,summary,content,create_time,link,nid) VALUES"+str(data) cursor.execute(sql) db.commit() # 提交数据 print("success") cursor.close() db.close() def SaveDataxls(data): global workbook, worksheet dataid=int(data[0]) # 写入excel # 参数对应 行, 列, 值 worksheet.write(dataid, 0, str(dataid)) worksheet.write(dataid, 1, data[1]) worksheet.write(dataid, 2, data[2]) worksheet.write(dataid, 3, data[3]) worksheet.write(dataid, 4, data[4]) worksheet.write(dataid, 5, data[5]) worksheet.write(dataid, 6, data[6]) worksheet.write(dataid, 7, data[7]) def startexcel(): global workbook, worksheet # 创建一个workbook 设置编码 workbook = xlwt.Workbook(encoding='utf-8') # 创建一个worksheet worksheet = workbook.add_sheet('data') # 参数对应 行, 列, 值 初始化 worksheet.write(0, 0, "id") worksheet.write(0, 1, "title") worksheet.write(0, 2, "auther") worksheet.write(0, 3, "summary") worksheet.write(0, 4, "content") worksheet.write(0, 5, "update") worksheet.write(0, 6, "source") worksheet.write(0, 7, "nid") if __name__ == "__main__": main() ``` # **3.酷推** **酷推是专门整合各类信息的网站,获取更多数据竟然还要充值季度会员,怒花10块大洋。** **爬了一些文章后,提示ip被ban,只能上代理了,这里用的是小象代理。** [https://www.xiaoxiangdaili.com/](https://www.xiaoxiangdaili.com/)短效ip 一块钱一天,还挺合算 ```python # coding = utf8mb4 import os import requests from bs4 import BeautifulSoup import time from retry import retry import multiprocessing import re global config import pymysql import text_summarizer import emoji config = { } def main(): wordlinks=[] #wordlinks.append('https://www.tuicool.com/ah/101050000/') #wordlinks.append('https://www.tuicool.com/ah/101040000/') # wordlinks.append('https://www.tuicool.com/ah/101000000/') # wordlinks.append('https://www.tuicool.com/ah/0/') # wordlinks.append('https://www.tuicool.com/topics/10050043?st=1&lang=1&pn=') # wordlinks.append('https://www.tuicool.com/topics/10050042?st=1&lang=1&pn=') # wordlinks.append('https://www.tuicool.com/topics/10050828?st=1&lang=1&pn=') # wordlinks.append('https://www.tuicool.com/topics/10050001?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/11020012?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/22120059?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/22190243?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/22120138?st=1&lang=1&pn=') requests.adapters.DEFAULT_RETRIES = 20 for w,wordlink in enumerate(wordlinks): for page in range(0, 2): proxies = changeip() header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36", "Connection": "close", "cookie": "UM_distinctid=17c275c006ece3-083b66b719969d-a7d173c-384000-17c275c006fe78; __gads=ID=96df49f668e51a53-22c50fc1eecb00b5:T=1632748471:RT=1632748471:S=ALNI_MZ6tw2ZepbAUa-I3s2IIpLh2yOvCA; CNZZDATA5541078=cnzz_eid%3D620993454-1632739373-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1632756015; _tuicool_session=OW4rL2MvODBaeFhBV1lHN1VqVWVVdjRYaUN6cXJZa3BGeGFCQTRybWRWZnVsS1IrZXlYb2E0NlpjMnNzUUo0NEhGTU94N1F2VS9FTjFVQzkySUxDMFFDL2FlRittN1VORGlxUmV6Wk9MTjZ5eG82ZVlpak9yVlozZm1RM3l0Y3pKU3hWM2RiTC9VVFRMUS8zbWZDb0lFZjhWay9RcGxBZVp4TFJlNURTQ3VwVHFDSzl6cm9NVGJMdlBrcTRIRWsvZEozY0cxL1B3SUNHN0Z5cEg5L240ZTRJaEpWNjl3aS9yU2Zoc1Q1U0Y4dk4vajM5YjNZbmtrSlg1MUN3V1hCMmc2U3cxVTFsdVhSTFVvb3doZFNhYmNJWDZ0MitMTW1ENExseG4vZHd4eVArWktpYkJCdWZtMzZHL1Jxa1kxdS8wOFNNRi93N2xHZFF1SVJUN0pOUWEwRkpTRHFKZUxudm9Xb3dEYXdSUStVU1l0K3hKQkRhWUZMU0d3YjQ1UExNRmphTkhiWTlyOFA4WG91RFVMekRyVTVGdzRGUXFlRzJrY1dKa21ncGtLTC9jREFqR1J3U2d5MGFKaWhQSGcwYlk3UmtTdEtlRDRDdi84QXpaN0RBYlF1elNpK09Iazc2UFFOUWQ1TkMxWjQ9LS04T3Ntd0VoMGxPbmhmWnBBekl6ZTV3PT0%3D--da792ede16542253ae36372e2225d41a90ea6760" } i = 0 while i < 5: try: raw = requests.get(wordlink + str(page), headers=header, proxies=proxies, timeout=2).text break except requests.exceptions.RequestException: i = i + 1 print('请求超时,正在重试...1') if i == 5: continue if str(raw)=="": break soup = BeautifulSoup(raw, "html.parser") textlistdata = soup.select(".title a") textlist = [] ids = [] for i in textlistdata: textlist.append("https://www.tuicool.com" + i['href']) ids.append(i['href']) #p = multiprocessing.Process(target=spider, args=(textlist,wordname[w],ids,header,proxies)) #p.start() spider(textlist,ids,header,proxies) # time.sleep() time.sleep(5) print("第" + str(page) + "页采集成功...") os.system('python Article_Keywords.py') os.system('python classify.py') def changeip(): ip = requests.get( 'https://api.xiaoxiangdaili.com/ip/get?appKey=760113727143301120&appSecret=uRjSbv3w&cnt=&wt=text').text while ip=='{"code":1010,"success":false,"data":null,"msg":"请求过于频繁"}': time.sleep(5) ip = requests.get( 'https://api.xiaoxiangdaili.com/ip/get?appKey=760113727143301120&appSecret=uRjSbv3w&cnt=&wt=text').text print("更换ip成功" + ip) proxies = { 'http': ip, 'https': ip } return proxies def spider(textlist,ids,header,proxies): global config imgre1=re.compile('') passre = re.compile('<.*?>') db = pymysql.connect(**config,charset="utf8mb4") cursor = db.cursor() for v, link in enumerate(textlist): sql = "SELECT nid FROM article WHERE nid='" + (str(ids[v]).replace('/','')+"'") # 查询文章绝对id是否存在 msg = cursor.execute(sql) if msg == 1: print("该条数据已经存在------" ) continue i = 0 while i < 5: try: raw = requests.get(link, headers=header, proxies=proxies, timeout=2).text break except requests.exceptions.RequestException: i = i + 1 print('请求超时,正在重试...2') if i == 5: proxies = changeip() continue soup = BeautifulSoup(raw, "html.parser") try: article = str(soup.select(".article_body")[0]) article=article.replace('\xa0','').replace('xa0','').replace('','**').replace("","**").replace("","# ").replace("","").replace("","## ").replace("","").replace('ufeff','\n').replace('','\n').replace('','\n') article=re.sub(imgre1,' article = re.sub(imgre2, ')',article) article=re.sub(passre,"",article) article=filter_emoji(article) timea=soup.select('.timestamp')[0].text.replace('时间\xa0','').replace('\n','').replace(' ','') author=soup.select('.cut')[0].text.replace('\n','').replace(' ','') author=filter_emoji(author) source=soup.select('.cut')[1].text nnid=ids[v].replace("/", "") title=soup.select('h1')[0].text title=filter_emoji(title) except: continue data=(title,author,article,timea,source,nnid) print(data) sql = "INSERT INTO article(title,author,content,create_time,link,nid) VALUES" + str(data) try: cursor.execute(sql) except: continue db.commit() # 提交数据 print("success") cursor.close() db.close() #if os.path.exists(".\\" + wordname + "\\" + wordname + ids[v].replace("/", "-") + ".txt") == 0 and article != "": # f = open(".\\" + wordname + "\\" + wordname + ids[v].replace("/", "-") + ".txt", 'a', # encoding='utf-8') # f.write(article) # f.close # print(".\\" + wordname + "\\" + wordname + ids[v].replace("/", "-") + ".txt 采集成功..." + link) #else: # print("已经存在...跳过") # continue def filter_emoji(desstr, restr=''): # 过滤表情 try: co = re.compile(u'[\U00010000-\U0010ffff]') except re.error: co = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]') return co.sub(restr, desstr) def SaveMyql(data): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "INSERT INTO article(title,author,content,create_time,link,nid) VALUES"+str(data) cursor.execute(sql) db.commit() # 提交数据 print("success") cursor.close() db.close() # 按间距中的绿色按钮以运行脚本。 if __name__ == '__main__': main() ``` **上面很长的一个列表是获取词条单页时需要用到的,为此专门写了个小脚本,只不过没有将其整合。** ```python import requests import re import json def main(): id=[] link=[] header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36", "Connection": "close", "cookie": "UM_distinctid=17c275c006ece3-083b66b719969d-a7d173c-384000-17c275c006fe78; __gads=ID=96df49f668e51a53-22c50fc1eecb00b5:T=1632748471:RT=1632748471:S=ALNI_MZ6tw2ZepbAUa-I3s2IIpLh2yOvCA; CNZZDATA5541078=cnzz_eid%3D620993454-1632739373-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1632756015; _tuicool_session=OW4rL2MvODBaeFhBV1lHN1VqVWVVdjRYaUN6cXJZa3BGeGFCQTRybWRWZnVsS1IrZXlYb2E0NlpjMnNzUUo0NEhGTU94N1F2VS9FTjFVQzkySUxDMFFDL2FlRittN1VORGlxUmV6Wk9MTjZ5eG82ZVlpak9yVlozZm1RM3l0Y3pKU3hWM2RiTC9VVFRMUS8zbWZDb0lFZjhWay9RcGxBZVp4TFJlNURTQ3VwVHFDSzl6cm9NVGJMdlBrcTRIRWsvZEozY0cxL1B3SUNHN0Z5cEg5L240ZTRJaEpWNjl3aS9yU2Zoc1Q1U0Y4dk4vajM5YjNZbmtrSlg1MUN3V1hCMmc2U3cxVTFsdVhSTFVvb3doZFNhYmNJWDZ0MitMTW1ENExseG4vZHd4eVArWktpYkJCdWZtMzZHL1Jxa1kxdS8wOFNNRi93N2xHZFF1SVJUN0pOUWEwRkpTRHFKZUxudm9Xb3dEYXdSUStVU1l0K3hKQkRhWUZMU0d3YjQ1UExNRmphTkhiWTlyOFA4WG91RFVMekRyVTVGdzRGUXFlRzJrY1dKa21ncGtLTC9jREFqR1J3U2d5MGFKaWhQSGcwYlk3UmtTdEtlRDRDdi84QXpaN0RBYlF1elNpK09Iazc2UFFOUWQ1TkMxWjQ9LS04T3Ntd0VoMGxPbmhmWnBBekl6ZTV3PT0%3D--da792ede16542253ae36372e2225d41a90ea6760" } for page in range(2,10): raw = requests.get("https://www.tuicool.com/topics/my_hot?id="+str(page),headers=header).text jsondata=json.loads(raw)['data'] for i in jsondata: id.append(i['id']) for x in id: print("wordlinks.append('https://www.tuicool.com/topics/"+str(x)+"?st=1&lang=1&pn=')") if __name__ == '__main__': main() ``` # **4.心得与总结** **团队协作的感觉还不错(也可能是我负责的部分太简单),只需要专心做自己的事情就可以了,多多学习吧,掌握更多的技术。** Loading... **近期参加了一个项目,主要是负责数据部分,现在这个时候也没我什么事情了,写一下项目中用到的爬虫总结。**  # **0.总结** ## **1)pymysql连接数据库** ```python config = { "host": "xxx.com", "port": 0, "user": "xxx", "password": "xxx", "database": "xxx" } ``` ```python db = pymysql.connect(**config) cursor = db.cursor() #获取游标 sql = "xxx" #要执行的sql语句 cursor.execute(sql) #执行sql db.commit() # 提交数据 cursor.close() db.close()#关闭游标和数据库 ```   单次执行一条sql语句效率不如批量执行多条,也就是db.commit()可以在for循环最后执行。 使用数据库后应关闭游标和数据库 ## **2)request.get()在长时间运行后报错** 可以运用get中的参数timeout= 和try来进行再次尝试,一般一到两次即可成功,网上查阅了很多资料,目前只有这一种管用。 ```python i = 0 while i < 5: try: raw = requests.get(link, headers=header, proxies=proxies, timeout=2).text break except requests.exceptions.RequestException: i = i + 1 print('请求超时,正在重试...2') ``` ## **3)代理ip** 爬酷推时遇到了限制ip访问的情况,可以选用代理ip。 [https://www.xiaoxiangdaili.com/](https://www.xiaoxiangdaili.com/)** 短效ip 一块钱一天,还挺合算 ```python def changeip(): ip = requests.get('xxx').text while ip=='{"code":1010,"success":false,"data":null,"msg":"请求过于频繁"}': time.sleep(5) ip = requests.get('xxx').text print("更换ip成功" + ip) proxies = { 'http': ip, 'https': ip } return proxies ``` ``` ``` **在request.get()函数中只需要带上proxies=proxies即可** raw = requests.get(link, headers=header, proxies=proxies, timeout=2) ## **4)保存数据到数据库时遇到的代码报错** **原因是数据库用的字符格式是utf8,而emoji表情大多需要utf8mb4格式,其中包括utf8,遇到该类错误首先更改数据库字符格式,其次将emoji替换,可以用正则也可以用emoji包,最后再无法解决可以用replace替换** 附上从网上找到的正则替换代码 ```python def filter_emoji(desstr, restr=''): # 过滤表情 try: co = re.compile(u'[\U00010000-\U0010ffff]') except re.error: co = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]') return co.sub(restr, desstr) ``` ## **5)beautifulsoup** request到的数据先用bs4解析 ```python soup = BeautifulSoup(raw, "html.parser") ``` soup.select真是个好东西,下面是他的选择方法  ## **6)正则** **主要是用到re.compile()和re.findall()以及re.sub()** **re.compile():定义正则规则** **re.findall():查找 参数为正则规则,数据,返回值为列表** **re.sub():替换 参数为正则规则,要替换成的数据,数据** # **1.简书** **简书的爬虫是我这三个爬虫中最好写的一个,没有任何反爬,用F12确定元素定位后,可以用正则,可以用bs4解析即可获得内容。**  **不过写简书的翻页请求数据开始花了点功夫,请求的页数数据要写在post请求数据包里,用一个for循环即可解决,还有一个难点就是一开始数据出现大量重复,查阅资料后发现需要在request请求头里更新cookie数据,否则数据会出现重复,降低效率。** **ajax异步加载需要用F12抓包获取请求地址** **数据整理替换去重时使用正则替换re.sub()和text.replace()** ```python # -*- codeing = utf-8 -*- # @Time : 2021/8/3 16:17 # @Author : yunqi # @Fire : jianshu.py # @Software : PyCharm import os import requests from bs4 import BeautifulSoup import re import time import xlwt import pymysql global header,config,cookie header = { 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'X-PJAX': 'true' # 'sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217b3466849938b-08f7999168164d-4343363-2073600-17b3466849ace1%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%7D%7D; sajssdk_2015_cross_new_user=1; Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1628420549,1628422073,1628423751,1628671739; locale=zh-CN; _ga=GA1.2.1168608088.1628671739; _gid=GA1.2.896180568.1628671739; UM_distinctid=17b346685e9346-0002a4ae6d88e7-4343363-1fa400-17b346685eae44; _m7e_session_core=f67ec2211a28225502a4d2b81632bcc0; signin_redirect=https%3A%2F%2Fwww.jianshu.com%2F; read_mode=day; default_font=font2; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1628682890; CNZZDATA1279807957=288995723-1628668245-%7C1628679045; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217b3466849938b-08f7999168164d-4343363-2073600-17b3466849ace1%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217b3510ac62318-05af7b13d1ca09-4343363-2073600-17b3510ac636e3%22%7D' } config = { } global workbook, worksheet global id, postdata postdata='' def main(): global cookie,header response = requests.get("https://www.jianshu.com/", headers=header) cookie = requests.utils.dict_from_cookiejar(response.cookies) global workbook, worksheet, id #startexcel() id=0 #=====================正则==== auth=re.compile(r'<a class=".*?" href=".*?"><span class=".*?">(.*?)</span></a>') textre=re.compile(r'<article class=".*?">') img1=re.compile(r'<div class="image-package">\n<div class="image-container" style=".*?">\n<div class="image-container-fill" style=".*?"></div>\n<div class="image-view" data-height=".*?" data-width=".*?"><img data-original-filesize=".*?" data-original-format="image/jpeg" data-original-height=".*?" data-original-src="') img2=re.compile(r'" data-original-width=".*?"/></div>\n</div>\n<div class="image-caption">.*?</div>\n</div>') timere=re.compile(r'data":{"is_author":.*?,"last_updated_at":(.*?),"public_title"') passre=re.compile('<.*?>') #===================== for num in range(1,15): #简书首页最多获取15页内容 links,seenid,jianjies=GetUrl(num) for v,i in enumerate(links): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "SELECT nid FROM article WHERE nid=" + seenid[v] # 查询文章绝对id是否存在 msg = cursor.execute(sql) cursor.close() db.close() if msg==1: print("该条数据已经存在") continue try: i = "https://www.jianshu.com" + i id += 1 raw = requests.get(i, headers=header).text #获取文章详情界面的内容 soup = BeautifulSoup(raw, "html.parser") title=soup.select('h1') #获取文章标题 auther=re.findall(auth,str(soup))[0] #获取作者 time1=int(re.findall(timere,str(soup))[0]) timeArray = time.localtime(time1) Time = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) #从时间戳转换到发布日期 nid=seenid[v] jianjie=jianjies[v] except IndexError: print('获取数据失败 跳过 可能是文章正在审核') print(i) id-=1 continue #----------对内容进行简单的处理 text=str(soup.select('article')[0]) text=re.sub(textre,"",text) text=re.sub(img1,"[https:",text) text = re.sub(img2, "]", text) text=text.replace("</article>","").replace("<p>","\n").replace("</p>","\n").replace("</br>","\n").replace("<b>","\n").replace("</b>",'\n').replace("<br/>",'\n') #内容替换 text = re.sub(passre, " ", text) data=(title[0].text,auther,jianjie,text,Time,i,nid) #SaveDataxls(data) print(data) SaveMyql(data) os.system('python Article_Keywords.py') # 保存 #workbook.save('data.xls') #time.sleep(30) # 休息三十秒 def GetUrl(num): #从简书首页获取文章链接列表 #======================================正则 seenidre=re.compile(r'<li id=".*?" data-note-id="(.*?)" class=".*?">') getlink=re.compile(r'<a class="title" target="_blank" href="(.*?)">.*?</a>')#获取文章links的正则 jianjiere=re.compile(r'<p class="abstract">\n (.*?)\n </p>') #===================================== global header,cookie global postdata print("page=" + str(num) + postdata) raw = requests.post("https://www.jianshu.com/trending_notes",data="page="+str(num)+postdata,headers=header,cookies=cookie).text #对page进行更改可以获取不同页数 urls=re.findall(getlink,raw) seenid=re.findall(seenidre,raw) jianjie=re.findall(jianjiere,raw) for esid in seenid: postdata=postdata+'&seen_snote_ids%5B%5D='+str(esid) return (urls,seenid,jianjie) #返回七个文章链接 每次加载七个 def SaveMyql(data): global config db = pymysql.connect(**config) cursor = db.cursor() #获取游标 sql = "INSERT INTO article(title,author,summary,content,create_time,link,nid) VALUES"+str(data) cursor.execute(sql) db.commit() # 提交数据 print("success") cursor.close() db.close() def SaveDataxls(data): global workbook, worksheet dataid=int(data[0]) # 写入excel # 参数对应 行, 列, 值 worksheet.write(dataid, 0, str(dataid)) worksheet.write(dataid, 1, data[1]) worksheet.write(dataid, 2, data[2]) worksheet.write(dataid, 3, data[3]) worksheet.write(dataid, 4, data[4]) worksheet.write(dataid, 5, data[5]) worksheet.write(dataid, 6, data[6]) worksheet.write(dataid, 7, data[7]) def startexcel(): global workbook, worksheet # 创建一个workbook 设置编码 workbook = xlwt.Workbook(encoding='utf-8') # 创建一个worksheet worksheet = workbook.add_sheet('data') # 参数对应 行, 列, 值 初始化 worksheet.write(0, 0, "id") worksheet.write(0, 1, "title") worksheet.write(0, 2, "auther") worksheet.write(0, 3, "summary") worksheet.write(0, 4, "content") worksheet.write(0, 5, "update") worksheet.write(0, 6, "source") worksheet.write(0, 7, "nid") if __name__ == "__main__": main() ``` # **2.少数派** **少数派这个网站还是项目里其他组员推荐的,于是就改了改简书的爬虫把它写下来了,基本大同小异,甚至比简书少了一步post时的数据校验。** ```python # -*- codeing = utf-8 -*- # @Time : 2021/8/3 16:17 # @Author : yunqi # @Fire : ssp.py # @Software : PyCharm import os import requests from bs4 import BeautifulSoup import re import time import xlwt import pymysql import json global header,config,cookie header = { 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', 'X-PJAX': 'true' # 'sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217b3466849938b-08f7999168164d-4343363-2073600-17b3466849ace1%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%7D%7D; sajssdk_2015_cross_new_user=1; Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1628420549,1628422073,1628423751,1628671739; locale=zh-CN; _ga=GA1.2.1168608088.1628671739; _gid=GA1.2.896180568.1628671739; UM_distinctid=17b346685e9346-0002a4ae6d88e7-4343363-1fa400-17b346685eae44; _m7e_session_core=f67ec2211a28225502a4d2b81632bcc0; signin_redirect=https%3A%2F%2Fwww.jianshu.com%2F; read_mode=day; default_font=font2; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1628682890; CNZZDATA1279807957=288995723-1628668245-%7C1628679045; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217b3466849938b-08f7999168164d-4343363-2073600-17b3466849ace1%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217b3510ac62318-05af7b13d1ca09-4343363-2073600-17b3510ac636e3%22%7D' } global workbook, worksheet global id, postdata postdata='' def main(): global cookie,header response = requests.get("https://sspai.com/", headers=header) cookie = requests.utils.dict_from_cookiejar(response.cookies) global workbook, worksheet, id #startexcel() #=====================正则==== textre=re.compile(r'<article class=".*?">') img1=re.compile(r'<img.*? data-original=".*?" src="') img2=re.compile(r'\?imageView.*?/>') passre=re.compile('<.*?>') #===================== for num in range(1,15): #首页最多获取15页内容 urls,jianjie,nid,title,author,creat_time=GetUrl(num) for v,i in enumerate(urls): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "SELECT nid FROM article WHERE nid=" + str(nid[v]) # 查询文章绝对id是否存在 msg = cursor.execute(sql) cursor.close() db.close() if msg==1: print("该条数据已经存在------"+title[v]) continue try: raw = requests.get(i, headers=header).text #获取文章详情界面的内容 soup = BeautifulSoup(raw, "html.parser") # ----------对内容进行简单的处理 text = str(soup.select('.article-body')[0]) text = re.sub(textre, "", text) text = re.sub(img1, "\n[", text) text = re.sub(img2, "]", text) text = text.replace("</article>", "").replace("<p>", "\n").replace("</p>", "\n").replace("</br>","\n").replace("<b>", "\n").replace("</b>", '\n').replace("<br/>", '\n').replace("\"/>","") # 内容替换 text = re.sub(passre, " ", text) timeArray = time.localtime(creat_time[v]) Time = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) data = (title[v], author[v], jianjie[v], text, Time, i, nid[v]) # SaveDataxls(data) print(data) SaveMyql(data) except IndexError: print('获取数据失败 跳过 可能是文章正在审核') print(i) time.sleep(10) continue time.sleep(1) os.system('python Article_Keywords.py') # 保存 #workbook.save('data.xls') #time.sleep(30) # 休息三十秒 def GetUrl(num): #从简书首页获取文章链接列表 #======================================正则 urls=[] jianjie=[] title=[] author=[] creat_time=[] nid=[] #===================================== global header,cookie timeArray = str(time.time()) raw=requests.get("https://sspai.com/api/v1/article/index/page/get?limit=10&offset="+str(num*10)+"&created_at="+timeArray,headers=header,cookies=cookie) jsondata=json.loads(raw.text)['data'] for i in jsondata: nid.append(i['id']) title.append(i['title']) jianjie.append(i['summary']) author.append(i['author']['nickname']) creat_time.append(i['released_time']) urls.append('https://sspai.com/post/'+str(i['id'])) return (urls,jianjie,nid,title,author,creat_time) #返回七个文章链接 每次加载七个 def SaveMyql(data): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "INSERT INTO article(title,author,summary,content,create_time,link,nid) VALUES"+str(data) cursor.execute(sql) db.commit() # 提交数据 print("success") cursor.close() db.close() def SaveDataxls(data): global workbook, worksheet dataid=int(data[0]) # 写入excel # 参数对应 行, 列, 值 worksheet.write(dataid, 0, str(dataid)) worksheet.write(dataid, 1, data[1]) worksheet.write(dataid, 2, data[2]) worksheet.write(dataid, 3, data[3]) worksheet.write(dataid, 4, data[4]) worksheet.write(dataid, 5, data[5]) worksheet.write(dataid, 6, data[6]) worksheet.write(dataid, 7, data[7]) def startexcel(): global workbook, worksheet # 创建一个workbook 设置编码 workbook = xlwt.Workbook(encoding='utf-8') # 创建一个worksheet worksheet = workbook.add_sheet('data') # 参数对应 行, 列, 值 初始化 worksheet.write(0, 0, "id") worksheet.write(0, 1, "title") worksheet.write(0, 2, "auther") worksheet.write(0, 3, "summary") worksheet.write(0, 4, "content") worksheet.write(0, 5, "update") worksheet.write(0, 6, "source") worksheet.write(0, 7, "nid") if __name__ == "__main__": main() ``` # **3.酷推** **酷推是专门整合各类信息的网站,获取更多数据竟然还要充值季度会员,怒花10块大洋。** **爬了一些文章后,提示ip被ban,只能上代理了,这里用的是小象代理。** [https://www.xiaoxiangdaili.com/](https://www.xiaoxiangdaili.com/)短效ip 一块钱一天,还挺合算 ```python # coding = utf8mb4 import os import requests from bs4 import BeautifulSoup import time from retry import retry import multiprocessing import re global config import pymysql import text_summarizer import emoji config = { } def main(): wordlinks=[] #wordlinks.append('https://www.tuicool.com/ah/101050000/') #wordlinks.append('https://www.tuicool.com/ah/101040000/') # wordlinks.append('https://www.tuicool.com/ah/101000000/') # wordlinks.append('https://www.tuicool.com/ah/0/') # wordlinks.append('https://www.tuicool.com/topics/10050043?st=1&lang=1&pn=') # wordlinks.append('https://www.tuicool.com/topics/10050042?st=1&lang=1&pn=') # wordlinks.append('https://www.tuicool.com/topics/10050828?st=1&lang=1&pn=') # wordlinks.append('https://www.tuicool.com/topics/10050001?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/11020012?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/22120059?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/22190243?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/22120138?st=1&lang=1&pn=') requests.adapters.DEFAULT_RETRIES = 20 for w,wordlink in enumerate(wordlinks): for page in range(0, 2): proxies = changeip() header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36", "Connection": "close", "cookie": "UM_distinctid=17c275c006ece3-083b66b719969d-a7d173c-384000-17c275c006fe78; __gads=ID=96df49f668e51a53-22c50fc1eecb00b5:T=1632748471:RT=1632748471:S=ALNI_MZ6tw2ZepbAUa-I3s2IIpLh2yOvCA; CNZZDATA5541078=cnzz_eid%3D620993454-1632739373-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1632756015; _tuicool_session=OW4rL2MvODBaeFhBV1lHN1VqVWVVdjRYaUN6cXJZa3BGeGFCQTRybWRWZnVsS1IrZXlYb2E0NlpjMnNzUUo0NEhGTU94N1F2VS9FTjFVQzkySUxDMFFDL2FlRittN1VORGlxUmV6Wk9MTjZ5eG82ZVlpak9yVlozZm1RM3l0Y3pKU3hWM2RiTC9VVFRMUS8zbWZDb0lFZjhWay9RcGxBZVp4TFJlNURTQ3VwVHFDSzl6cm9NVGJMdlBrcTRIRWsvZEozY0cxL1B3SUNHN0Z5cEg5L240ZTRJaEpWNjl3aS9yU2Zoc1Q1U0Y4dk4vajM5YjNZbmtrSlg1MUN3V1hCMmc2U3cxVTFsdVhSTFVvb3doZFNhYmNJWDZ0MitMTW1ENExseG4vZHd4eVArWktpYkJCdWZtMzZHL1Jxa1kxdS8wOFNNRi93N2xHZFF1SVJUN0pOUWEwRkpTRHFKZUxudm9Xb3dEYXdSUStVU1l0K3hKQkRhWUZMU0d3YjQ1UExNRmphTkhiWTlyOFA4WG91RFVMekRyVTVGdzRGUXFlRzJrY1dKa21ncGtLTC9jREFqR1J3U2d5MGFKaWhQSGcwYlk3UmtTdEtlRDRDdi84QXpaN0RBYlF1elNpK09Iazc2UFFOUWQ1TkMxWjQ9LS04T3Ntd0VoMGxPbmhmWnBBekl6ZTV3PT0%3D--da792ede16542253ae36372e2225d41a90ea6760" } i = 0 while i < 5: try: raw = requests.get(wordlink + str(page), headers=header, proxies=proxies, timeout=2).text break except requests.exceptions.RequestException: i = i + 1 print('请求超时,正在重试...1') if i == 5: continue if str(raw)=="<Response [404]>": break soup = BeautifulSoup(raw, "html.parser") textlistdata = soup.select(".title a") textlist = [] ids = [] for i in textlistdata: textlist.append("https://www.tuicool.com" + i['href']) ids.append(i['href']) #p = multiprocessing.Process(target=spider, args=(textlist,wordname[w],ids,header,proxies)) #p.start() spider(textlist,ids,header,proxies) # time.sleep() time.sleep(5) print("第" + str(page) + "页采集成功...") os.system('python Article_Keywords.py') os.system('python classify.py') def changeip(): ip = requests.get( 'https://api.xiaoxiangdaili.com/ip/get?appKey=760113727143301120&appSecret=uRjSbv3w&cnt=&wt=text').text while ip=='{"code":1010,"success":false,"data":null,"msg":"请求过于频繁"}': time.sleep(5) ip = requests.get( 'https://api.xiaoxiangdaili.com/ip/get?appKey=760113727143301120&appSecret=uRjSbv3w&cnt=&wt=text').text print("更换ip成功" + ip) proxies = { 'http': ip, 'https': ip } return proxies def spider(textlist,ids,header,proxies): global config imgre1=re.compile('<img.*?src="') imgre2=re.compile('!web.*?>') passre = re.compile('<.*?>') db = pymysql.connect(**config,charset="utf8mb4") cursor = db.cursor() for v, link in enumerate(textlist): sql = "SELECT nid FROM article WHERE nid='" + (str(ids[v]).replace('/','')+"'") # 查询文章绝对id是否存在 msg = cursor.execute(sql) if msg == 1: print("该条数据已经存在------" ) continue i = 0 while i < 5: try: raw = requests.get(link, headers=header, proxies=proxies, timeout=2).text break except requests.exceptions.RequestException: i = i + 1 print('请求超时,正在重试...2') if i == 5: proxies = changeip() continue soup = BeautifulSoup(raw, "html.parser") try: article = str(soup.select(".article_body")[0]) article=article.replace('\xa0','').replace('xa0','').replace('<strong>','**').replace("</strong>","**").replace("<h1>","# ").replace("</h1>","").replace("<h2>","## ").replace("</h2>","").replace('ufeff','\n').replace('</p>','\n').replace('<br>','\n') article=re.sub(imgre1,' article = re.sub(imgre2, ')',article) article=re.sub(passre,"",article) article=filter_emoji(article) timea=soup.select('.timestamp')[0].text.replace('时间\xa0','').replace('\n','').replace(' ','') author=soup.select('.cut')[0].text.replace('\n','').replace(' ','') author=filter_emoji(author) source=soup.select('.cut')[1].text nnid=ids[v].replace("/", "") title=soup.select('h1')[0].text title=filter_emoji(title) except: continue data=(title,author,article,timea,source,nnid) print(data) sql = "INSERT INTO article(title,author,content,create_time,link,nid) VALUES" + str(data) try: cursor.execute(sql) except: continue db.commit() # 提交数据 print("success") cursor.close() db.close() #if os.path.exists(".\\" + wordname + "\\" + wordname + ids[v].replace("/", "-") + ".txt") == 0 and article != "": # f = open(".\\" + wordname + "\\" + wordname + ids[v].replace("/", "-") + ".txt", 'a', # encoding='utf-8') # f.write(article) # f.close # print(".\\" + wordname + "\\" + wordname + ids[v].replace("/", "-") + ".txt 采集成功..." + link) #else: # print("已经存在...跳过") # continue def filter_emoji(desstr, restr=''): # 过滤表情 try: co = re.compile(u'[\U00010000-\U0010ffff]') except re.error: co = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]') return co.sub(restr, desstr) def SaveMyql(data): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "INSERT INTO article(title,author,content,create_time,link,nid) VALUES"+str(data) cursor.execute(sql) db.commit() # 提交数据 print("success") cursor.close() db.close() # 按间距中的绿色按钮以运行脚本。 if __name__ == '__main__': main() ``` **上面很长的一个列表是获取词条单页时需要用到的,为此专门写了个小脚本,只不过没有将其整合。** ```python import requests import re import json def main(): id=[] link=[] header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36", "Connection": "close", "cookie": "UM_distinctid=17c275c006ece3-083b66b719969d-a7d173c-384000-17c275c006fe78; __gads=ID=96df49f668e51a53-22c50fc1eecb00b5:T=1632748471:RT=1632748471:S=ALNI_MZ6tw2ZepbAUa-I3s2IIpLh2yOvCA; CNZZDATA5541078=cnzz_eid%3D620993454-1632739373-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1632756015; _tuicool_session=OW4rL2MvODBaeFhBV1lHN1VqVWVVdjRYaUN6cXJZa3BGeGFCQTRybWRWZnVsS1IrZXlYb2E0NlpjMnNzUUo0NEhGTU94N1F2VS9FTjFVQzkySUxDMFFDL2FlRittN1VORGlxUmV6Wk9MTjZ5eG82ZVlpak9yVlozZm1RM3l0Y3pKU3hWM2RiTC9VVFRMUS8zbWZDb0lFZjhWay9RcGxBZVp4TFJlNURTQ3VwVHFDSzl6cm9NVGJMdlBrcTRIRWsvZEozY0cxL1B3SUNHN0Z5cEg5L240ZTRJaEpWNjl3aS9yU2Zoc1Q1U0Y4dk4vajM5YjNZbmtrSlg1MUN3V1hCMmc2U3cxVTFsdVhSTFVvb3doZFNhYmNJWDZ0MitMTW1ENExseG4vZHd4eVArWktpYkJCdWZtMzZHL1Jxa1kxdS8wOFNNRi93N2xHZFF1SVJUN0pOUWEwRkpTRHFKZUxudm9Xb3dEYXdSUStVU1l0K3hKQkRhWUZMU0d3YjQ1UExNRmphTkhiWTlyOFA4WG91RFVMekRyVTVGdzRGUXFlRzJrY1dKa21ncGtLTC9jREFqR1J3U2d5MGFKaWhQSGcwYlk3UmtTdEtlRDRDdi84QXpaN0RBYlF1elNpK09Iazc2UFFOUWQ1TkMxWjQ9LS04T3Ntd0VoMGxPbmhmWnBBekl6ZTV3PT0%3D--da792ede16542253ae36372e2225d41a90ea6760" } for page in range(2,10): raw = requests.get("https://www.tuicool.com/topics/my_hot?id="+str(page),headers=header).text jsondata=json.loads(raw)['data'] for i in jsondata: id.append(i['id']) for x in id: print("wordlinks.append('https://www.tuicool.com/topics/"+str(x)+"?st=1&lang=1&pn=')") if __name__ == '__main__': main() ``` # **4.心得与总结** **团队协作的感觉还不错(也可能是我负责的部分太简单),只需要专心做自己的事情就可以了,多多学习吧,掌握更多的技术。** 最后修改:2022 年 09 月 29 日 © 允许规范转载 打赏 赞赏作者 支付宝微信 赞 3 如果觉得我的文章对你有用,请随意赞赏
') img1=re.compile(r'\n\n\n\n\n.*?\n') timere=re.compile(r'data":{"is_author":.*?,"last_updated_at":(.*?),"public_title"') passre=re.compile('<.*?>') #===================== for num in range(1,15): #简书首页最多获取15页内容 links,seenid,jianjies=GetUrl(num) for v,i in enumerate(links): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "SELECT nid FROM article WHERE nid=" + seenid[v] # 查询文章绝对id是否存在 msg = cursor.execute(sql) cursor.close() db.close() if msg==1: print("该条数据已经存在") continue try: i = "https://www.jianshu.com" + i id += 1 raw = requests.get(i, headers=header).text #获取文章详情界面的内容 soup = BeautifulSoup(raw, "html.parser") title=soup.select('h1') #获取文章标题 auther=re.findall(auth,str(soup))[0] #获取作者 time1=int(re.findall(timere,str(soup))[0]) timeArray = time.localtime(time1) Time = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) #从时间戳转换到发布日期 nid=seenid[v] jianjie=jianjies[v] except IndexError: print('获取数据失败 跳过 可能是文章正在审核') print(i) id-=1 continue #----------对内容进行简单的处理 text=str(soup.select('article')[0]) text=re.sub(textre,"",text) text=re.sub(img1,"[https:",text) text = re.sub(img2, "]", text) text=text.replace("
') img1=re.compile(r'", "").replace("", "\n").replace("", "\n").replace("","\n").replace("", "\n").replace("", '\n').replace("", '\n').replace("\"/>","") # 内容替换 text = re.sub(passre, " ", text) timeArray = time.localtime(creat_time[v]) Time = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) data = (title[v], author[v], jianjie[v], text, Time, i, nid[v]) # SaveDataxls(data) print(data) SaveMyql(data) except IndexError: print('获取数据失败 跳过 可能是文章正在审核') print(i) time.sleep(10) continue time.sleep(1) os.system('python Article_Keywords.py') # 保存 #workbook.save('data.xls') #time.sleep(30) # 休息三十秒 def GetUrl(num): #从简书首页获取文章链接列表 #======================================正则 urls=[] jianjie=[] title=[] author=[] creat_time=[] nid=[] #===================================== global header,cookie timeArray = str(time.time()) raw=requests.get("https://sspai.com/api/v1/article/index/page/get?limit=10&offset="+str(num*10)+"&created_at="+timeArray,headers=header,cookies=cookie) jsondata=json.loads(raw.text)['data'] for i in jsondata: nid.append(i['id']) title.append(i['title']) jianjie.append(i['summary']) author.append(i['author']['nickname']) creat_time.append(i['released_time']) urls.append('https://sspai.com/post/'+str(i['id'])) return (urls,jianjie,nid,title,author,creat_time) #返回七个文章链接 每次加载七个 def SaveMyql(data): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "INSERT INTO article(title,author,summary,content,create_time,link,nid) VALUES"+str(data) cursor.execute(sql) db.commit() # 提交数据 print("success") cursor.close() db.close() def SaveDataxls(data): global workbook, worksheet dataid=int(data[0]) # 写入excel # 参数对应 行, 列, 值 worksheet.write(dataid, 0, str(dataid)) worksheet.write(dataid, 1, data[1]) worksheet.write(dataid, 2, data[2]) worksheet.write(dataid, 3, data[3]) worksheet.write(dataid, 4, data[4]) worksheet.write(dataid, 5, data[5]) worksheet.write(dataid, 6, data[6]) worksheet.write(dataid, 7, data[7]) def startexcel(): global workbook, worksheet # 创建一个workbook 设置编码 workbook = xlwt.Workbook(encoding='utf-8') # 创建一个worksheet worksheet = workbook.add_sheet('data') # 参数对应 行, 列, 值 初始化 worksheet.write(0, 0, "id") worksheet.write(0, 1, "title") worksheet.write(0, 2, "auther") worksheet.write(0, 3, "summary") worksheet.write(0, 4, "content") worksheet.write(0, 5, "update") worksheet.write(0, 6, "source") worksheet.write(0, 7, "nid") if __name__ == "__main__": main() ``` # **3.酷推** **酷推是专门整合各类信息的网站,获取更多数据竟然还要充值季度会员,怒花10块大洋。** **爬了一些文章后,提示ip被ban,只能上代理了,这里用的是小象代理。** [https://www.xiaoxiangdaili.com/](https://www.xiaoxiangdaili.com/)短效ip 一块钱一天,还挺合算 ```python # coding = utf8mb4 import os import requests from bs4 import BeautifulSoup import time from retry import retry import multiprocessing import re global config import pymysql import text_summarizer import emoji config = { } def main(): wordlinks=[] #wordlinks.append('https://www.tuicool.com/ah/101050000/') #wordlinks.append('https://www.tuicool.com/ah/101040000/') # wordlinks.append('https://www.tuicool.com/ah/101000000/') # wordlinks.append('https://www.tuicool.com/ah/0/') # wordlinks.append('https://www.tuicool.com/topics/10050043?st=1&lang=1&pn=') # wordlinks.append('https://www.tuicool.com/topics/10050042?st=1&lang=1&pn=') # wordlinks.append('https://www.tuicool.com/topics/10050828?st=1&lang=1&pn=') # wordlinks.append('https://www.tuicool.com/topics/10050001?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/11020012?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/22120059?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/22190243?st=1&lang=1&pn=') wordlinks.append('https://www.tuicool.com/topics/22120138?st=1&lang=1&pn=') requests.adapters.DEFAULT_RETRIES = 20 for w,wordlink in enumerate(wordlinks): for page in range(0, 2): proxies = changeip() header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36", "Connection": "close", "cookie": "UM_distinctid=17c275c006ece3-083b66b719969d-a7d173c-384000-17c275c006fe78; __gads=ID=96df49f668e51a53-22c50fc1eecb00b5:T=1632748471:RT=1632748471:S=ALNI_MZ6tw2ZepbAUa-I3s2IIpLh2yOvCA; CNZZDATA5541078=cnzz_eid%3D620993454-1632739373-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1632756015; _tuicool_session=OW4rL2MvODBaeFhBV1lHN1VqVWVVdjRYaUN6cXJZa3BGeGFCQTRybWRWZnVsS1IrZXlYb2E0NlpjMnNzUUo0NEhGTU94N1F2VS9FTjFVQzkySUxDMFFDL2FlRittN1VORGlxUmV6Wk9MTjZ5eG82ZVlpak9yVlozZm1RM3l0Y3pKU3hWM2RiTC9VVFRMUS8zbWZDb0lFZjhWay9RcGxBZVp4TFJlNURTQ3VwVHFDSzl6cm9NVGJMdlBrcTRIRWsvZEozY0cxL1B3SUNHN0Z5cEg5L240ZTRJaEpWNjl3aS9yU2Zoc1Q1U0Y4dk4vajM5YjNZbmtrSlg1MUN3V1hCMmc2U3cxVTFsdVhSTFVvb3doZFNhYmNJWDZ0MitMTW1ENExseG4vZHd4eVArWktpYkJCdWZtMzZHL1Jxa1kxdS8wOFNNRi93N2xHZFF1SVJUN0pOUWEwRkpTRHFKZUxudm9Xb3dEYXdSUStVU1l0K3hKQkRhWUZMU0d3YjQ1UExNRmphTkhiWTlyOFA4WG91RFVMekRyVTVGdzRGUXFlRzJrY1dKa21ncGtLTC9jREFqR1J3U2d5MGFKaWhQSGcwYlk3UmtTdEtlRDRDdi84QXpaN0RBYlF1elNpK09Iazc2UFFOUWQ1TkMxWjQ9LS04T3Ntd0VoMGxPbmhmWnBBekl6ZTV3PT0%3D--da792ede16542253ae36372e2225d41a90ea6760" } i = 0 while i < 5: try: raw = requests.get(wordlink + str(page), headers=header, proxies=proxies, timeout=2).text break except requests.exceptions.RequestException: i = i + 1 print('请求超时,正在重试...1') if i == 5: continue if str(raw)=="": break soup = BeautifulSoup(raw, "html.parser") textlistdata = soup.select(".title a") textlist = [] ids = [] for i in textlistdata: textlist.append("https://www.tuicool.com" + i['href']) ids.append(i['href']) #p = multiprocessing.Process(target=spider, args=(textlist,wordname[w],ids,header,proxies)) #p.start() spider(textlist,ids,header,proxies) # time.sleep() time.sleep(5) print("第" + str(page) + "页采集成功...") os.system('python Article_Keywords.py') os.system('python classify.py') def changeip(): ip = requests.get( 'https://api.xiaoxiangdaili.com/ip/get?appKey=760113727143301120&appSecret=uRjSbv3w&cnt=&wt=text').text while ip=='{"code":1010,"success":false,"data":null,"msg":"请求过于频繁"}': time.sleep(5) ip = requests.get( 'https://api.xiaoxiangdaili.com/ip/get?appKey=760113727143301120&appSecret=uRjSbv3w&cnt=&wt=text').text print("更换ip成功" + ip) proxies = { 'http': ip, 'https': ip } return proxies def spider(textlist,ids,header,proxies): global config imgre1=re.compile('') passre = re.compile('<.*?>') db = pymysql.connect(**config,charset="utf8mb4") cursor = db.cursor() for v, link in enumerate(textlist): sql = "SELECT nid FROM article WHERE nid='" + (str(ids[v]).replace('/','')+"'") # 查询文章绝对id是否存在 msg = cursor.execute(sql) if msg == 1: print("该条数据已经存在------" ) continue i = 0 while i < 5: try: raw = requests.get(link, headers=header, proxies=proxies, timeout=2).text break except requests.exceptions.RequestException: i = i + 1 print('请求超时,正在重试...2') if i == 5: proxies = changeip() continue soup = BeautifulSoup(raw, "html.parser") try: article = str(soup.select(".article_body")[0]) article=article.replace('\xa0','').replace('xa0','').replace('','**').replace("","**").replace("","# ").replace("","").replace("","## ").replace("","").replace('ufeff','\n').replace('','\n').replace('','\n') article=re.sub(imgre1,' article = re.sub(imgre2, ')',article) article=re.sub(passre,"",article) article=filter_emoji(article) timea=soup.select('.timestamp')[0].text.replace('时间\xa0','').replace('\n','').replace(' ','') author=soup.select('.cut')[0].text.replace('\n','').replace(' ','') author=filter_emoji(author) source=soup.select('.cut')[1].text nnid=ids[v].replace("/", "") title=soup.select('h1')[0].text title=filter_emoji(title) except: continue data=(title,author,article,timea,source,nnid) print(data) sql = "INSERT INTO article(title,author,content,create_time,link,nid) VALUES" + str(data) try: cursor.execute(sql) except: continue db.commit() # 提交数据 print("success") cursor.close() db.close() #if os.path.exists(".\\" + wordname + "\\" + wordname + ids[v].replace("/", "-") + ".txt") == 0 and article != "": # f = open(".\\" + wordname + "\\" + wordname + ids[v].replace("/", "-") + ".txt", 'a', # encoding='utf-8') # f.write(article) # f.close # print(".\\" + wordname + "\\" + wordname + ids[v].replace("/", "-") + ".txt 采集成功..." + link) #else: # print("已经存在...跳过") # continue def filter_emoji(desstr, restr=''): # 过滤表情 try: co = re.compile(u'[\U00010000-\U0010ffff]') except re.error: co = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]') return co.sub(restr, desstr) def SaveMyql(data): global config db = pymysql.connect(**config) cursor = db.cursor() sql = "INSERT INTO article(title,author,content,create_time,link,nid) VALUES"+str(data) cursor.execute(sql) db.commit() # 提交数据 print("success") cursor.close() db.close() # 按间距中的绿色按钮以运行脚本。 if __name__ == '__main__': main() ``` **上面很长的一个列表是获取词条单页时需要用到的,为此专门写了个小脚本,只不过没有将其整合。** ```python import requests import re import json def main(): id=[] link=[] header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36", "Connection": "close", "cookie": "UM_distinctid=17c275c006ece3-083b66b719969d-a7d173c-384000-17c275c006fe78; __gads=ID=96df49f668e51a53-22c50fc1eecb00b5:T=1632748471:RT=1632748471:S=ALNI_MZ6tw2ZepbAUa-I3s2IIpLh2yOvCA; CNZZDATA5541078=cnzz_eid%3D620993454-1632739373-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1632756015; _tuicool_session=OW4rL2MvODBaeFhBV1lHN1VqVWVVdjRYaUN6cXJZa3BGeGFCQTRybWRWZnVsS1IrZXlYb2E0NlpjMnNzUUo0NEhGTU94N1F2VS9FTjFVQzkySUxDMFFDL2FlRittN1VORGlxUmV6Wk9MTjZ5eG82ZVlpak9yVlozZm1RM3l0Y3pKU3hWM2RiTC9VVFRMUS8zbWZDb0lFZjhWay9RcGxBZVp4TFJlNURTQ3VwVHFDSzl6cm9NVGJMdlBrcTRIRWsvZEozY0cxL1B3SUNHN0Z5cEg5L240ZTRJaEpWNjl3aS9yU2Zoc1Q1U0Y4dk4vajM5YjNZbmtrSlg1MUN3V1hCMmc2U3cxVTFsdVhSTFVvb3doZFNhYmNJWDZ0MitMTW1ENExseG4vZHd4eVArWktpYkJCdWZtMzZHL1Jxa1kxdS8wOFNNRi93N2xHZFF1SVJUN0pOUWEwRkpTRHFKZUxudm9Xb3dEYXdSUStVU1l0K3hKQkRhWUZMU0d3YjQ1UExNRmphTkhiWTlyOFA4WG91RFVMekRyVTVGdzRGUXFlRzJrY1dKa21ncGtLTC9jREFqR1J3U2d5MGFKaWhQSGcwYlk3UmtTdEtlRDRDdi84QXpaN0RBYlF1elNpK09Iazc2UFFOUWQ1TkMxWjQ9LS04T3Ntd0VoMGxPbmhmWnBBekl6ZTV3PT0%3D--da792ede16542253ae36372e2225d41a90ea6760" } for page in range(2,10): raw = requests.get("https://www.tuicool.com/topics/my_hot?id="+str(page),headers=header).text jsondata=json.loads(raw)['data'] for i in jsondata: id.append(i['id']) for x in id: print("wordlinks.append('https://www.tuicool.com/topics/"+str(x)+"?st=1&lang=1&pn=')") if __name__ == '__main__': main() ``` # **4.心得与总结** **团队协作的感觉还不错(也可能是我负责的部分太简单),只需要专心做自己的事情就可以了,多多学习吧,掌握更多的技术。**
5 条评论
222
3344
1122
嘿嘿
111