记一次pyhon爬虫——以洛谷为例

每次都只能在家里的电脑才能更新博客，太不方便了，如果能在浏览器上动态更新就好了。但是这样需要云服务器，，，没钱。于是想了个逆天的方案：

在某博客网站上更新文章。
在家里电脑设定程序定期爬取该网站的文章。
更新。

wa wa wa，太完美了吧。

于是目标变成：爬取私有文章。

于是就有了个前提：登录账号。

一种简单的方式就是手动获取cookie，然后放进请求头里就可以了。但是这样不优雅，还要定期手动重新获取。

第二种方案，就是模拟登录获取cookie。我选择这一种。

首先可以用 F12 发现洛谷登录请求的流程：

请求 https://www.luogu.com.cn/auth/login（GET），似乎没什么用。
在输入用户名后，请求 https://www.luogu.com.cn/auth/login-methods?login=username（GET）。没看到返回的是啥。
提交登录，https://www.luogu.com.cn/do-auth/password（POST），要用户名密码和验证码。

另外有趣的是，洛谷反爬机制挺强的，第一会通过 cookie 验证，错误是返回 set_cookie，然后设置成这个就行了，挺搞笑的。

另一个是在请求登录时，发现请求头会多出两个：

X-Requested-With : XMLHttpRequest，听说是用来表明是 AJAX 异步接口请求的。
X-CSRF-TOKEN : …，这个就麻烦了，查询了资料。

资料

找了半天都没找到这玩意在哪生成的，幸好有前人指路。首先应该不是在 URL 或请求头中，我没看到。

哦！，就在网页的 html 里：<meta name="csrf-token" content="1774766584:XvhhVwI22HPNHDEKIAzL5n9j3EERx5qCFt5GGZrhafw="> “

？

1
$ python .\luogu.py
2
Hi agent, as you see, I am just a cute teapot. Wanna a cup of tea?

问题不大，改下 headers 就行了。

1
import re
2
response = requests.get(HOMEPAGE , headers = HEADERS)
3
csrf_token = re.search(r'<meta name="csrf-token" content="([^"]+)"', response.text).group(1)
4
print(csrf_token)

嗯。

然后按照流程模拟就行了：

最难的应该就结束了，然后扩展的话可以加一个验证码识别。这部分先搁一下，先把爬取文章实现了。

这里找到一个洛谷 API 的文档：

按照上面的说明，编写代码，并自动更新文件信息。

1
def get_blogs_list(session : requests.Session , user , page): #670766
2
    API = "https://www.luogu.com.cn/api/blog/userBlogs"
3
    blog_list = session.get(API , headers = HEADERS , params = {"user" : str(user) , "page" : page})
4
    return blog_list.json()
5

6
def pull_blog_md(session : requests.Session , id , path):
7
    API = "https://www.luogu.com.cn/api/blog/detail/"
8
    ret = session.get(API + id , headers = HEADERS).json()["data"]
9
    title = ret["Title"][4:]
10
    post_time = strftime("%Y-%m-%d" , localtime(ret["PostTime"]))
11
    content = ret["Content"]
12
    _ = re.search(r"\A```[\s\S]*?```" , content)
13
    print(f"拉取 {title}")
14
    if _ == None:
15
        print(f"{title} 格式错误！")
16
        return False
17
    information = _.group()
18
    information = information[3:(len(information) - 3)]
19
    other = content[_.end():]
20
    with open(f"{path}\\{title}.md" , "w" , encoding = "utf-8") as f:
21
        f.write("---")
22
        f.write(information)
23
        f.write(f"published: {post_time}\npubDate: {post_time}\ndate: {post_time}\ntitle: {title}\n---")
24
        f.write(other)
25
    print(f"拉取 {title} 成功！")
26
    return True
27

28
def pull_context(session : requests.Session , user_id , path):
29
    now_page = 1
30
    while True:
31
        blog_list = get_blogs_list(session , user_id , now_page)["blogs"]["result"]
32
        if len(blog_list) == 0: break
33
        for i in blog_list:
34
            if i["title"][:4] == "blog":
35
                pull_blog_md(session , str(i["id"]) , path)
36
        now_page += 1

下一步可以把，拉取下来的存到 blog 的文件夹，在调用我之前写的更新程序，自动上传到 github。

下面是我在洛谷上创建的测试文章：

爬虫可以爬取创建时间等，这部分我让程序自动帮我填写。剩下的我就照常写在文章开头。

下面是爬取下来后的效果：

嗯，挺好的。

然后调用更新：

1
$ python .\pull_from_luogu.py
2
Cookie 已加载
3
Cookie 有效性: True
4
当前 Cookie:  {'__client_id': '不给你看', '_uid': '670766'}
5
拉取 "测试测试"
6
拉取 "测试测试" 成功！
7
$ python .\update_post.py
8
更新 "测试测试"
9
git ok.

效果你们应该已经看到了。。

剩下的还有一个问题，就是版本问题：什么叫版本问题呢，就是目前我的文章应该有三个端：github，本地，还有洛谷。每次修改是要以哪个为准呢？嗯，懒得搞了，反正就我自己用。。

就这样吧。

最后放下代码吧：

1
import requests , re , base64 , json
2
from time import localtime , strftime
3
from os import mkdir
4
from os.path import exists
5

6
USERID = 670766
7
SAVE_PATH = r"D:\\notes\\blog\\"
8

9
HOMEPAGE = "https://www.luogu.com.cn/auth/login"
10
CAPTCHA = "https://www.luogu.com.cn/lg4/captcha"
11
LOGIN = "https://www.luogu.com.cn/do-auth/password" #POST
12
LOGIN_METHODS = "https://www.luogu.com.cn/auth/login-methods"
13

14
COOKIE_FILE = "luogu_cookie.json"
15
PASSWORD_FILE = "luogu_password.txt"
16

17
HEADERS = {
18
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0" ,
19
    "Referer" : "https://www.luogu.com.cn/" ,
20
    "Origin" : "https://www.luogu.com.cn" ,
21
    "X-Requested-With" : "XMLHttpRequest" , # AJAX 异步接口请求
22
}
23

24
def save_cookie(session: requests.Session):
25
    with open(COOKIE_FILE , "w" , encoding = "utf-8") as f:
26
        json.dump(requests.utils.dict_from_cookiejar(session.cookies) , f , indent = 2)
27
    print("Cookie 已保存")
28

29
def load_cookie(session: requests.Session):
30
    with open(COOKIE_FILE , "r", encoding = "utf-8") as f:
31
        session.cookies.update(json.loads(f.read()))
32
    print("Cookie 已加载")
33

34
def check_cookie_valid(session: requests.Session):
35
    check_url = "https://www.luogu.com.cn/user/setting"
36
    resp = session.get(check_url, headers = HEADERS , allow_redirects=False)
37
    valid = resp.status_code == 200
38
    print(f"Cookie 有效性: {'True' if valid else 'False'}")
39
    return valid
40

41
def login(session: requests.Session):
42
    LG_HEADERS = {
43
        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0" ,
44
        "Referer" : "https://www.luogu.com.cn/" ,
45
        "X-Requested-With" : "XMLHttpRequest" ,
46
        "Host" : "www.luogu.com.cn" ,
47
    }
48

49
    response = session.get(HOMEPAGE , headers = LG_HEADERS , allow_redirects = True)
50
    csrf_token = re.search(r'<meta name="csrf-token" content="([^"]+)"', response.text).group(1)
51

52
    LG_HEADERS["X-CSRF-TOKEN"] = csrf_token
53

54
    with open(PASSWORD_FILE , "r") as f:
55
        USERNAME = base64.b64decode(f.readline()).decode()
56
        PASSWORD = base64.b64decode(f.readline()).decode()
57

58
    LG_HEADERS["Referer"] = "https://www.luogu.com.cn/auth/login"
59
    login1 = session.get(LOGIN_METHODS , params = {"login" : USERNAME} , headers = LG_HEADERS)
60
    print("状态码：" , login1.status_code)
61

62
    while True:
63
        captcha_image = session.get(CAPTCHA , headers = LG_HEADERS , allow_redirects = False).content
64
        with open("luogu_captcha.jpeg" , "wb") as f:
65
            f.write(captcha_image)
66
            f.flush()
67

68
        captcha = input("请查看并输入验证码：")[:4]
69
        print(captcha , len(captcha))
70

71
        data = {
72
            "captcha" : captcha ,
73
            "password" : PASSWORD ,
74
            "username" : USERNAME
75
        }
76

77
        # HEADERS["Content-Type"] = "application/json"
78
        LG_HEADERS["Origin"] = "https://www.luogu.com.cn"
79
        login2 = session.post(LOGIN , headers = LG_HEADERS , data = data , allow_redirects = False)
80
        print("状态码：" , login2.status_code)
81

82
        if login2.status_code == 200:
83
            print("\033[32m登录成功\033[0m")
84
            save_cookie(session)
85
            return True
86
        elif login2.status_code == 400:
87
            print("\033[31m登录失败\003[0m，验证码错误：")
88
        elif login2.status_code == 401:
89
            print("\033[31m登录失败\033[0m，密码错误：")
90
        else:
91
            print("\033[31m登录失败\033[0m，其他错误：")
92
            print(login2.text)
93

94
def get_blogs_list(session : requests.Session , user , page): #670766
95
    API = "https://www.luogu.com.cn/api/blog/userBlogs"
96
    blog_list = session.get(API , headers = HEADERS , params = {"user" : str(user) , "page" : page})
97
    return blog_list.json()
98

99
def pull_blog_md(session : requests.Session , id , path):
100
    API = "https://www.luogu.com.cn/api/blog/detail/"
101
    ret = session.get(API + id , headers = HEADERS).json()["data"]
102
    title = ret["Title"][5:]
103
    post_time = strftime("%Y-%m-%d" , localtime(ret["PostTime"]))
104
    content = ret["Content"]
105
    _ = re.search(r"\A```[\s\S]*?```" , content)
106
    if _ == None:
107
        print(f"\033[31m格式错误！\033[0m： 跳过文章 \"{title}\"")
108
        return False
109
    information = _.group()
110
    information = information[3:(len(information) - 3)]
111
    other = content[_.end():]
112

113
    if not exists(f"{path}\\{title}"): mkdir(f"{path}\\{title}")
114
    with open(f"{path}\\{title}\\index.md" , "w" , encoding = "utf-8") as f:
115
        f.write("---")
116
        f.write(information)
117
        f.write(f"published: {post_time}\npubDate: {post_time}\ndate: {post_time}\ntitle: {title}\n---")
118
        f.write(other)
119
    print(f"\033[32m拉取成功！\033[0m：\"{title}\" 已更新！")
120
    return True
121

122
def pull_context(session : requests.Session , user_id , path):
123
    now_page = 1
124
    while True:
125
        blog_list = get_blogs_list(session , user_id , now_page)["blogs"]["result"]
126
        if len(blog_list) == 0: break
127
        for i in blog_list:
128
            if i["title"][:4] == "blog":
129
                pull_blog_md(session , str(i["id"]) , path)
130
        now_page += 1
131

132
if __name__ == "__main__":
133
    session = requests.Session()
134
    load_cookie(session)
135
    if not check_cookie_valid(session):
136
        session.cookies.clear()
137
        login(session)
138
    print("当前 Cookie: " , requests.utils.dict_from_cookiejar(session.cookies))
139
    print("开始拉取：\n")
140
    pull_context(session , USERID , SAVE_PATH)

QWQ_SenLin