本案例以土地市场列表页请求、供地结果搜索页请求为例,其实网站采集难度不大,但是一些细节值得学习。
链接:https://www.landchina.com/default.aspx?tabid=263
抓包分析
POST接口:https://www.landchina.com/default.aspx?tabid=263
Requests-headers中没有动态参数。
Formdata中看起来也没有动态参数。不过有一个参数是 (unable to decode value)
也就是说找到无法显示的 TAB_QuerySubmitConditionData 参数就能构造请求了。
参数定位
定位该参数:TAB_QuerySubmitConditionData
Initiator中没有堆栈记录,说明该请求是在页面中完成的。
全局搜索TAB_QuerySubmitConditionData ,只有html文件中有,那就说明是在页面内取值并提交的。
查看 element。
可以看到此处 value 为 42ad98ae-c46a-40aa-aacc-c0884036eeaf:110101▓~东城区
- 42ad98ae-c46a-40aa-aacc-c0884036eeaf 是一个固定值
- 110101 是一个地区ID
- ▓ 是一个特殊字符
- 东城区 是我们检索词
怕复制下来的格式出问题,所以百度了一下如何输入该特殊字符
按住ALt键不放,再按键盘右侧数字小键盘的43144,松开ALT键,就能打出来 ▓
模拟请求
参数已经确定好了,我们可以进行POST的模拟请求。
因为Form表单中含有(unable to decode value),(这是对方网页编码的问题)
所以我们在请求时需要把这个参数转化下编码,encode("gbk")。
请求示例:
import requests
data = {
"__VIEWSTATE": "/wEPDwUJNjkzNzgyNTU4D2QWAmYPZBYIZg9kFgICAQ9kFgJmDxYCHgNzcmMFN1VzZXIvZGVmYXVsdC9VcGxvYWQvc3lzRnJhbWVJbWcveF90ZHNjdzIwMjBfZmxhc2hfMS5wbmdkAgEPZBYCAgEPFgIeBXN0eWxlBSBCQUNLR1JPVU5ELUNPTE9SOiNmM2Y1Zjc7Q09MT1I6O2QCAg9kFgICAQ9kFgJmD2QWAmYPZBYCZg9kFgRmD2QWAmYPZBYCZg9kFgJmD2QWAmYPZBYCZg8WBB8BBSBDT0xPUjojRDNEM0QzO0JBQ0tHUk9VTkQtQ09MT1I6Ox4HVmlzaWJsZWgWAmYPZBYCAgEPZBYCZg8PFgIeBFRleHRlZGQCAQ9kFgJmD2QWAmYPZBYCZg9kFgRmD2QWAmYPFgQfAQWHAUNPTE9SOiNEM0QzRDM7QkFDS0dST1VORC1DT0xPUjo7QkFDS0dST1VORC1JTUFHRTp1cmwoaHR0cDovL3d3dy5sYW5kY2hpbmEuY29tL1VzZXIvZGVmYXVsdC9VcGxvYWQvc3lzRnJhbWVJbWcveF90ZHNjd19zeV9qaGdnXzAwMC5naWYpOx4GaGVpZ2h0BQEzFgJmD2QWAgIBD2QWAmYPDxYCHwNlZGQCAg9kFgJmD2QWAmYPZBYCZg9kFgJmD2QWAmYPZBYCZg9kFgRmD2QWAmYPFgQfAQUgQ09MT1I6I0QzRDNEMztCQUNLR1JPVU5ELUNPTE9SOjsfAmgWAmYPZBYCAgEPZBYCZg8PFgIfA2VkZAICD2QWAmYPZBYEZg9kFgJmD2QWAmYPZBYCZg9kFgJmD2QWAmYPZBYCZg8WBB8BBSBDT0xPUjojRDNEM0QzO0JBQ0tHUk9VTkQtQ09MT1I6Ox8CaBYCZg9kFgICAQ9kFgJmDw8WAh8DZWRkAgIPZBYEZg9kFgJmD2QWAmYPZBYCZg9kFgICAQ9kFgJmDxYEHwEFhgFDT0xPUjojRDNEM0QzO0JBQ0tHUk9VTkQtQ09MT1I6O0JBQ0tHUk9VTkQtSU1BR0U6dXJsKGh0dHA6Ly93d3cubGFuZGNoaW5hLmNvbS9Vc2VyL2RlZmF1bHQvVXBsb2FkL3N5c0ZyYW1lSW1nL3hfdGRzY3dfenlfamdnZ18wMS5naWYpOx8EBQI0NhYCZg9kFgICAQ9kFgJmDw8WAh8DZWRkAgEPZBYCZg9kFgJmD2QWAmYPZBYCAgEPZBYCZg8WBB8BBSBDT0xPUjojRDNEM0QzO0JBQ0tHUk9VTkQtQ09MT1I6Ox8CaBYCZg9kFgICAQ9kFgJmDw8WAh8DZWRkAgMPZBYCAgMPFgQeCWlubmVyaHRtbAWuEDxwPjxzdHlsZSB0eXBlPSJ0ZXh0L2NzcyI+QTpsaW5rIHsgQ09MT1I6IzAwMDAwMDsgVEVYVC1ERUNPUkFUSU9OOk5vbmV9QTp2aXNpdGVkIHsgIENPTE9SOiMwMDAwMDA7IFRFWFQtREVDT1JBVElPTjpOb25lfUE6YWN0aXZlIHsgICAgICBDT0xPUjojMDAwMDAwOyBURVhULURFQ09SQVRJT046Tm9uZX1BOmhvdmVyIHsgICAgQ09MT1I6IzAwOTlGRjsgVEVYVC1ERUNPUkFUSU9OOk5vbmV9PC9zdHlsZT48L3A+PHA+PGJyIC8+Jm5ic3A7PC9wPjx0YWJsZT48dGJvZHk+PHRyIGNsYXNzPSJmaXJzdFJvdyI+PHRkIHZhbGlnbj0idG9wIiB3aWR0aD0iMzcwIiBzdHlsZT0iYm9yZGVyLWJvdHRvbTogMHB4IHNvbGlkOyBib3JkZXItbGVmdDogMHB4IHNvbGlkOyBib3JkZXItdG9wOiAwcHggc29saWQ7IGJvcmRlci1yaWdodDogMHB4IHNvbGlkIj48cCBzdHlsZT0idGV4dC1hbGlnbjogY2VudGVyIj48YSB0YXJnZXQ9Il9zZWxmIiBocmVmPSJodHRwczovL3d3dy5sYW5kY2hpbmEuY29tLyI+PGltZyB0aXRsZT0idGRzY3dfbG9nZTEucG5nIiBhbHQ9InRkc2N3X2xvZ2UxLnBuZyIgc3JjPSJodHRwczovL3d3dy5sYW5kY2hpbmEuY29tL25ld21hbmFnZS91ZWRpdG9yL3V0ZjgtbmV0L25ldC91cGxvYWQvaW1hZ2UvMjAyMDA2MTAvNjM3Mjc0MDYzNDI4NzcxMTA4MTExMTMxMi5wbmciIC8+PC9hPjwvcD48L3RkPjx0ZCB2YWxpZ249InRvcCIgd2lkdGg9IjYyMCIgc3R5bGU9ImJvcmRlci1ib3R0b206IDBweCBzb2xpZDsgYm9yZGVyLWxlZnQ6IDBweCBzb2xpZDsgd29yZC1icmVhazogYnJlYWstYWxsOyBib3JkZXItdG9wOiAwcHggc29saWQ7IGJvcmRlci1yaWdodDogMHB4IHNvbGlkIj48c3BhbiBzdHlsZT0iZm9udC1mYW1pbHk6IOWui+S9kywgU2ltU3VuOyBjb2xvcjogcmdiKDI1NSwyNTUsMjU1KTsgZm9udC1zaXplOiAxMnB4Ij7kuLvlip7vvJroh6rnhLbotYTmupDpg6jkuI3liqjkuqfnmbvorrDkuK3lv4PvvIjoh6rnhLbotYTmupDpg6jms5XlvovkuovliqHkuK3lv4PvvIk8L3NwYW4+PHA+PHNwYW4gc3R5bGU9ImZvbnQtZmFtaWx5OiDlrovkvZMsIFNpbVN1bjsgY29sb3I6IHJnYigyNTUsMjU1LDI1NSk7IGZvbnQtc2l6ZTogMTJweCI+5oyH5a+85Y2V5L2N77ya6Ieq54S26LWE5rqQ6YOo6Ieq54S26LWE5rqQ5byA5Y+R5Yip55So5Y+4Jm5ic3A7ICZuYnNwO+aKgOacr+aUr+aMge+8mua1meaxn+iHu+WWhOenkeaKgOiCoeS7veaciemZkOWFrOWPuDwvc3Bhbj48L3A+PHA+PHNwYW4gc3R5bGU9ImNvbG9yOiAjZmZmZmZmIj48c3BhbiBzdHlsZT0iZm9udC1mYW1pbHk6IOWui+S9kywgU2ltU3VuOyBmb250LXNpemU6IDEycHgiPjxhIGhyZWY9Imh0dHBzOi8vYmVpYW4ubWlpdC5nb3YuY24vIj48c3BhbiBzdHlsZT0iY29sb3I6ICNmZmZmZmYiPuS6rElDUOWkhzEyMDM5NDE05Y+3LTQ8L3NwYW4+PC9hPjwvc3Bhbj48L3NwYW4+PHNwYW4gc3R5bGU9ImZvbnQtZmFtaWx5OiDlrovkvZMsIFNpbVN1bjsgY29sb3I6IHJnYigyNTUsMjU1LDI1NSk7IGZvbnQtc2l6ZTogMTJweCI+Jm5ic3A7Jm5ic3A7Jm5ic3A7PGEgaHJlZj0iaHR0cHM6Ly93d3cuYmVpYW4uZ292LmNuL3BvcnRhbC9yZWdpc3RlclN5c3RlbUluZm8/cmVjb3JkY29kZT0xMTAxMDIwMjAwODk5MCI+PHNwYW4gc3R5bGU9ImNvbG9yOiAjZmZmZmZmIj7kuqzlhaznvZHlronlpIcxMTAxMDIwMjAwODk5MDwvc3Bhbj48L2E+Jm5ic3A7Jm5ic3A7Jm5ic3A76YKu566x77yabGFuZGNoaW5hMjE4QDE2My5jb20mbmJzcDsmbmJzcDs8c2NyaXB0IHR5cGU9InRleHQvamF2YXNjcmlwdCI+dmFyIF9iZGhtUHJvdG9jb2wgPSAoKCJodHRwczoiID09IGRvY3VtZW50LmxvY2F0aW9uLnByb3RvY29sKSA/ICIgaHR0cHM6Ly8iIDogIiBodHRwczovLyIpO2RvY3VtZW50LndyaXRlKHVuZXNjYXBlKCIlM0NzY3JpcHQgc3JjPSciICsgX2JkaG1Qcm90b2NvbCArICJobS5iYWlkdS5jb20vaC5qcyUzRjgzODUzODU5YzcyNDdjNWIwM2I1Mjc4OTQ2MjJkM2ZhJyB0eXBlPSd0ZXh0L2phdmFzY3JpcHQnJTNFJTNDL3NjcmlwdCUzRSIpKTs8L3NjcmlwdD48L3NwYW4+PC9wPjwvdGQ+PC90cj48L3Rib2R5PjwvdGFibGU+PHA+Jm5ic3A7PC9wPh8BBWRCQUNLR1JPVU5ELUlNQUdFOnVybChodHRwOi8vd3d3LmxhbmRjaGluYS5jb20vVXNlci9kZWZhdWx0L1VwbG9hZC9zeXNGcmFtZUltZy94X3Rkc2N3MjAxM195d18xLmpwZyk7ZGSROBpN7Ou6S2YtyT/YJE2rnjHfndNLarLWFJhIlQuyjA==",
"__VIEWSTATEGENERATOR": "CA0B0334",
"__EVENTVALIDATION": "/wEdAAISCq2FkCh/InrAaZFxC1vNCeA4P5qp+tM6YGffBqgTjY2TFC6PLXgOad3UkDIJ23GnLFsuDKRNysjMxLxyvjLD",
"hidComName": "default",
"TAB_QueryConditionItem": "42ad98ae-c46a-40aa-aacc-c0884036eeaf",
"TAB_QuerySortItemList": "282:False",
"TAB_QuerySubmitConditionData": '42ad98ae-c46a-40aa-aacc-c0884036eeaf:11▓~北京市'.encode("gbk"),
"TAB_QuerySubmitOrderData": "282:False",
"TAB_RowButtonActionControl":"",
"TAB_QuerySubmitPagerData": "1",
"TAB_QuerySubmitSortData":"",
}
headers = {
# 太多了,就不贴了
}
print(requests.post('https://www.landchina.com/default.aspx?tabid=263', data=data, headers=headers).text)
可以成功获取数据,请求完成。
地区ID获取
如果要做范围批量采集,就需要先获取地区的名称和ID
点击行政区时出现的枚举列表其实是一个网页。
链接:https://www.landchina.com/ExtendModule/WorkAction/EnumSelectEx.aspx?group=1&n=TAB_queryTblEnumItem_256
ID和地区名在这里进行获取。
也是一个POST请求。
这里就比较简单了,不再代码示例。
本节案例结束。
备注
详情页中的土地来源。
需要把 mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r2_c4_ctrl 这个id下面的数提取出来,然后跟面积(公顷)进行对比。
- 如果相等 就是现有建设用地
- 如果这个值是0,就是新增建设用地
- 如果这个值不等于0, 小于面积(公顷) 就是 新增建设用地(来自存量库)
供地结果搜索
供地结果搜索请求
import json
from lxpy import copy_headers_dict
payloadHeader2 = copy_headers_dict(
'''
accept: application/json, text/plain, */*
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9
cache-control: no-cache
content-type: application/json;charset=UTF-8
origin: https://landchina.com
pragma: no-cache
referer: https://landchina.com/
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36
'''
)
payload = {"pageNum":1,"pageSize":10,"xzqDm":"11","gyFs":"","tdYt":"","startDate":"","endDate":"","dzBaBh":"","tdZl":""}
url= "https://landchina.com/resultNotice"
data = json.dumps(payload)
import requests
lb_list = requests.post('https://api.landchina.com/tGdxm/result/list',data=data,headers=payloadHeader2)
print(lb_list.text)
hash值
这个我没看,转群友发的图片。
搜关键词hash,第一个js文件中就找到了。
把上面这段通过 sha256 加密