中国土地市场网爬虫案例

本案例以土地市场列表页请求、供地结果搜索页请求为例,其实网站采集难度不大,但是一些细节值得学习。
在这里插入图片描述

链接:https://www.landchina.com/default.aspx?tabid=263


抓包分析

POST接口:https://www.landchina.com/default.aspx?tabid=263

Requests-headers中没有动态参数。

Formdata中看起来也没有动态参数。不过有一个参数是 (unable to decode value)

在这里插入图片描述

也就是说找到无法显示的 TAB_QuerySubmitConditionData 参数就能构造请求了。


参数定位

定位该参数:TAB_QuerySubmitConditionData

Initiator中没有堆栈记录,说明该请求是在页面中完成的。
在这里插入图片描述
全局搜索TAB_QuerySubmitConditionData ,只有html文件中有,那就说明是在页面内取值并提交的。

查看 element。

在这里插入图片描述
可以看到此处 value 为 42ad98ae-c46a-40aa-aacc-c0884036eeaf:110101▓~东城区

  • 42ad98ae-c46a-40aa-aacc-c0884036eeaf 是一个固定值
  • 110101 是一个地区ID
  • ▓ 是一个特殊字符
  • 东城区 是我们检索词

怕复制下来的格式出问题,所以百度了一下如何输入该特殊字符

按住ALt键不放,再按键盘右侧数字小键盘的43144,松开ALT键,就能打出来 ▓


模拟请求

参数已经确定好了,我们可以进行POST的模拟请求。

因为Form表单中含有(unable to decode value),(这是对方网页编码的问题)
在这里插入图片描述

所以我们在请求时需要把这个参数转化下编码,encode("gbk")。

请求示例:

import requests

data = {
"__VIEWSTATE": "/wEPDwUJNjkzNzgyNTU4D2QWAmYPZBYIZg9kFgICAQ9kFgJmDxYCHgNzcmMFN1VzZXIvZGVmYXVsdC9VcGxvYWQvc3lzRnJhbWVJbWcveF90ZHNjdzIwMjBfZmxhc2hfMS5wbmdkAgEPZBYCAgEPFgIeBXN0eWxlBSBCQUNLR1JPVU5ELUNPTE9SOiNmM2Y1Zjc7Q09MT1I6O2QCAg9kFgICAQ9kFgJmD2QWAmYPZBYCZg9kFgRmD2QWAmYPZBYCZg9kFgJmD2QWAmYPZBYCZg8WBB8BBSBDT0xPUjojRDNEM0QzO0JBQ0tHUk9VTkQtQ09MT1I6Ox4HVmlzaWJsZWgWAmYPZBYCAgEPZBYCZg8PFgIeBFRleHRlZGQCAQ9kFgJmD2QWAmYPZBYCZg9kFgRmD2QWAmYPFgQfAQWHAUNPTE9SOiNEM0QzRDM7QkFDS0dST1VORC1DT0xPUjo7QkFDS0dST1VORC1JTUFHRTp1cmwoaHR0cDovL3d3dy5sYW5kY2hpbmEuY29tL1VzZXIvZGVmYXVsdC9VcGxvYWQvc3lzRnJhbWVJbWcveF90ZHNjd19zeV9qaGdnXzAwMC5naWYpOx4GaGVpZ2h0BQEzFgJmD2QWAgIBD2QWAmYPDxYCHwNlZGQCAg9kFgJmD2QWAmYPZBYCZg9kFgJmD2QWAmYPZBYCZg9kFgRmD2QWAmYPFgQfAQUgQ09MT1I6I0QzRDNEMztCQUNLR1JPVU5ELUNPTE9SOjsfAmgWAmYPZBYCAgEPZBYCZg8PFgIfA2VkZAICD2QWAmYPZBYEZg9kFgJmD2QWAmYPZBYCZg9kFgJmD2QWAmYPZBYCZg8WBB8BBSBDT0xPUjojRDNEM0QzO0JBQ0tHUk9VTkQtQ09MT1I6Ox8CaBYCZg9kFgICAQ9kFgJmDw8WAh8DZWRkAgIPZBYEZg9kFgJmD2QWAmYPZBYCZg9kFgICAQ9kFgJmDxYEHwEFhgFDT0xPUjojRDNEM0QzO0JBQ0tHUk9VTkQtQ09MT1I6O0JBQ0tHUk9VTkQtSU1BR0U6dXJsKGh0dHA6Ly93d3cubGFuZGNoaW5hLmNvbS9Vc2VyL2RlZmF1bHQvVXBsb2FkL3N5c0ZyYW1lSW1nL3hfdGRzY3dfenlfamdnZ18wMS5naWYpOx8EBQI0NhYCZg9kFgICAQ9kFgJmDw8WAh8DZWRkAgEPZBYCZg9kFgJmD2QWAmYPZBYCAgEPZBYCZg8WBB8BBSBDT0xPUjojRDNEM0QzO0JBQ0tHUk9VTkQtQ09MT1I6Ox8CaBYCZg9kFgICAQ9kFgJmDw8WAh8DZWRkAgMPZBYCAgMPFgQeCWlubmVyaHRtbAWuEDxwPjxzdHlsZSB0eXBlPSJ0ZXh0L2NzcyI+QTpsaW5rIHsgQ09MT1I6IzAwMDAwMDsgVEVYVC1ERUNPUkFUSU9OOk5vbmV9QTp2aXNpdGVkIHsgIENPTE9SOiMwMDAwMDA7IFRFWFQtREVDT1JBVElPTjpOb25lfUE6YWN0aXZlIHsgICAgICBDT0xPUjojMDAwMDAwOyBURVhULURFQ09SQVRJT046Tm9uZX1BOmhvdmVyIHsgICAgQ09MT1I6IzAwOTlGRjsgVEVYVC1ERUNPUkFUSU9OOk5vbmV9PC9zdHlsZT48L3A+PHA+PGJyIC8+Jm5ic3A7PC9wPjx0YWJsZT48dGJvZHk+PHRyIGNsYXNzPSJmaXJzdFJvdyI+PHRkIHZhbGlnbj0idG9wIiB3aWR0aD0iMzcwIiBzdHlsZT0iYm9yZGVyLWJvdHRvbTogMHB4IHNvbGlkOyBib3JkZXItbGVmdDogMHB4IHNvbGlkOyBib3JkZXItdG9wOiAwcHggc29saWQ7IGJvcmRlci1yaWdodDogMHB4IHNvbGlkIj48cCBzdHlsZT0idGV4dC1hbGlnbjogY2VudGVyIj48YSB0YXJnZXQ9Il9zZWxmIiBocmVmPSJodHRwczovL3d3dy5sYW5kY2hpbmEuY29tLyI+PGltZyB0aXRsZT0idGRzY3dfbG9nZTEucG5nIiBhbHQ9InRkc2N3X2xvZ2UxLnBuZyIgc3JjPSJodHRwczovL3d3dy5sYW5kY2hpbmEuY29tL25ld21hbmFnZS91ZWRpdG9yL3V0ZjgtbmV0L25ldC91cGxvYWQvaW1hZ2UvMjAyMDA2MTAvNjM3Mjc0MDYzNDI4NzcxMTA4MTExMTMxMi5wbmciIC8+PC9hPjwvcD48L3RkPjx0ZCB2YWxpZ249InRvcCIgd2lkdGg9IjYyMCIgc3R5bGU9ImJvcmRlci1ib3R0b206IDBweCBzb2xpZDsgYm9yZGVyLWxlZnQ6IDBweCBzb2xpZDsgd29yZC1icmVhazogYnJlYWstYWxsOyBib3JkZXItdG9wOiAwcHggc29saWQ7IGJvcmRlci1yaWdodDogMHB4IHNvbGlkIj48c3BhbiBzdHlsZT0iZm9udC1mYW1pbHk6IOWui+S9kywgU2ltU3VuOyBjb2xvcjogcmdiKDI1NSwyNTUsMjU1KTsgZm9udC1zaXplOiAxMnB4Ij7kuLvlip7vvJroh6rnhLbotYTmupDpg6jkuI3liqjkuqfnmbvorrDkuK3lv4PvvIjoh6rnhLbotYTmupDpg6jms5XlvovkuovliqHkuK3lv4PvvIk8L3NwYW4+PHA+PHNwYW4gc3R5bGU9ImZvbnQtZmFtaWx5OiDlrovkvZMsIFNpbVN1bjsgY29sb3I6IHJnYigyNTUsMjU1LDI1NSk7IGZvbnQtc2l6ZTogMTJweCI+5oyH5a+85Y2V5L2N77ya6Ieq54S26LWE5rqQ6YOo6Ieq54S26LWE5rqQ5byA5Y+R5Yip55So5Y+4Jm5ic3A7ICZuYnNwO+aKgOacr+aUr+aMge+8mua1meaxn+iHu+WWhOenkeaKgOiCoeS7veaciemZkOWFrOWPuDwvc3Bhbj48L3A+PHA+PHNwYW4gc3R5bGU9ImNvbG9yOiAjZmZmZmZmIj48c3BhbiBzdHlsZT0iZm9udC1mYW1pbHk6IOWui+S9kywgU2ltU3VuOyBmb250LXNpemU6IDEycHgiPjxhIGhyZWY9Imh0dHBzOi8vYmVpYW4ubWlpdC5nb3YuY24vIj48c3BhbiBzdHlsZT0iY29sb3I6ICNmZmZmZmYiPuS6rElDUOWkhzEyMDM5NDE05Y+3LTQ8L3NwYW4+PC9hPjwvc3Bhbj48L3NwYW4+PHNwYW4gc3R5bGU9ImZvbnQtZmFtaWx5OiDlrovkvZMsIFNpbVN1bjsgY29sb3I6IHJnYigyNTUsMjU1LDI1NSk7IGZvbnQtc2l6ZTogMTJweCI+Jm5ic3A7Jm5ic3A7Jm5ic3A7PGEgaHJlZj0iaHR0cHM6Ly93d3cuYmVpYW4uZ292LmNuL3BvcnRhbC9yZWdpc3RlclN5c3RlbUluZm8/cmVjb3JkY29kZT0xMTAxMDIwMjAwODk5MCI+PHNwYW4gc3R5bGU9ImNvbG9yOiAjZmZmZmZmIj7kuqzlhaznvZHlronlpIcxMTAxMDIwMjAwODk5MDwvc3Bhbj48L2E+Jm5ic3A7Jm5ic3A7Jm5ic3A76YKu566x77yabGFuZGNoaW5hMjE4QDE2My5jb20mbmJzcDsmbmJzcDs8c2NyaXB0IHR5cGU9InRleHQvamF2YXNjcmlwdCI+dmFyIF9iZGhtUHJvdG9jb2wgPSAoKCJodHRwczoiID09IGRvY3VtZW50LmxvY2F0aW9uLnByb3RvY29sKSA/ICIgaHR0cHM6Ly8iIDogIiBodHRwczovLyIpO2RvY3VtZW50LndyaXRlKHVuZXNjYXBlKCIlM0NzY3JpcHQgc3JjPSciICsgX2JkaG1Qcm90b2NvbCArICJobS5iYWlkdS5jb20vaC5qcyUzRjgzODUzODU5YzcyNDdjNWIwM2I1Mjc4OTQ2MjJkM2ZhJyB0eXBlPSd0ZXh0L2phdmFzY3JpcHQnJTNFJTNDL3NjcmlwdCUzRSIpKTs8L3NjcmlwdD48L3NwYW4+PC9wPjwvdGQ+PC90cj48L3Rib2R5PjwvdGFibGU+PHA+Jm5ic3A7PC9wPh8BBWRCQUNLR1JPVU5ELUlNQUdFOnVybChodHRwOi8vd3d3LmxhbmRjaGluYS5jb20vVXNlci9kZWZhdWx0L1VwbG9hZC9zeXNGcmFtZUltZy94X3Rkc2N3MjAxM195d18xLmpwZyk7ZGSROBpN7Ou6S2YtyT/YJE2rnjHfndNLarLWFJhIlQuyjA==",
"__VIEWSTATEGENERATOR": "CA0B0334",
"__EVENTVALIDATION": "/wEdAAISCq2FkCh/InrAaZFxC1vNCeA4P5qp+tM6YGffBqgTjY2TFC6PLXgOad3UkDIJ23GnLFsuDKRNysjMxLxyvjLD",
"hidComName": "default",
"TAB_QueryConditionItem": "42ad98ae-c46a-40aa-aacc-c0884036eeaf",
"TAB_QuerySortItemList": "282:False",
"TAB_QuerySubmitConditionData": '42ad98ae-c46a-40aa-aacc-c0884036eeaf:11▓~北京市'.encode("gbk"),
"TAB_QuerySubmitOrderData": "282:False",
"TAB_RowButtonActionControl":"",
"TAB_QuerySubmitPagerData": "1",
"TAB_QuerySubmitSortData":"",
}

headers = {
# 太多了,就不贴了
}
print(requests.post('https://www.landchina.com/default.aspx?tabid=263', data=data, headers=headers).text)

可以成功获取数据,请求完成。
在这里插入图片描述


地区ID获取

如果要做范围批量采集,就需要先获取地区的名称和ID

点击行政区时出现的枚举列表其实是一个网页。
在这里插入图片描述
链接:https://www.landchina.com/ExtendModule/WorkAction/EnumSelectEx.aspx?group=1&n=TAB_queryTblEnumItem_256

ID和地区名在这里进行获取。
在这里插入图片描述
也是一个POST请求。
在这里插入图片描述
这里就比较简单了,不再代码示例。

本节案例结束。


备注

详情页中的土地来源。

需要把 mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r2_c4_ctrl 这个id下面的数提取出来,然后跟面积(公顷)进行对比。

  • 如果相等 就是现有建设用地
  • 如果这个值是0,就是新增建设用地
  • 如果这个值不等于0, 小于面积(公顷) 就是 新增建设用地(来自存量库)

供地结果搜索

供地结果搜索请求

import json
from lxpy import copy_headers_dict

payloadHeader2 = copy_headers_dict(
'''
accept: application/json, text/plain, */*
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9
cache-control: no-cache
content-type: application/json;charset=UTF-8
origin: https://landchina.com
pragma: no-cache
referer: https://landchina.com/
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36
'''
)
payload = {"pageNum":1,"pageSize":10,"xzqDm":"11","gyFs":"","tdYt":"","startDate":"","endDate":"","dzBaBh":"","tdZl":""}
url= "https://landchina.com/resultNotice"
data = json.dumps(payload)
import requests
lb_list = requests.post('https://api.landchina.com/tGdxm/result/list',data=data,headers=payloadHeader2)
print(lb_list.text)

hash值

这个我没看,转群友发的图片。

搜关键词hash,第一个js文件中就找到了。
在这里插入图片描述
把上面这段通过 sha256 加密
在这里插入图片描述

点赞

发表回复