一次写爬虫碰到的问题

fork了lanbin510的豆瓣爬虫,想改写下写数据库。 这里有2个问题,一个是作者原先是基于python2.7的,我是直接用的3.6.9,所以有些语法上不大一样。
https://github.com/timoseven/DouBanSpider

因为要用mysql,直接docker来一个就行了,结果发现默认装的都8.0了,我还在用5.6呢

1
sudo docker run --name bookdb -p 3306:3306 -e MYSQL\_ROOT\_PASSWORD=123456 -d mysql

至于8.0增加了什么东西,可以先参考下这个。但是这个不是我碰到的问题。
#https://www.jianshu.com/p/be29467c2b0c

开始建表了这就。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
CREATE TABLE IF NOT EXISTS `bookinfo`(
`id` INT UNSIGNED AUTO_INCREMENT,
`book_title` VARCHAR(100) NOT NULL,
`book_douban_id` INT UNSIGNED NOT NULL,
`book_rate` FLOAT(3,2) NOT NULL,
`book_author` VARCHAR(100) NOT NULL,
`book_rate_user` INT UNSIGNED NOT NULL,
`book_press` VARCHAR(100) NOT NULL,
`book_press_date` VARCHAR(20) NOT NULL,
`book_price` FLOAT(6,2) NOT NULL,
`book_tag` VARCHAR(100) NOT NULL,
`createtime` DATE DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY ( `id` )
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

mysql> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 1681 | Specifying number of digits for floating point data types is deprecated and will be removed in a future release. |
| Warning | 1681 | Specifying number of digits for floating point data types is deprecated and will be removed in a future release. |
| Warning | 3719 | 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous. |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

我去,建一个表居然出来3个warning,这也是牛逼大发了,不过还好都是关于未来的,而不是现在的。

不过接下去这个问题就不是经验可以覆盖的,必须google,发现创建新用户一直都是报错。

1
2
mysql> GRANT ALL PRIVILEGES ON douban_book.* TO 'douban'@'127.0.0.1'IDENTIFIED BY '123456';
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'IDENTIFIED BY '123456'' at line 1

这都没开始flush呢,还居然报错了,实在看不出来是什么问题。只好放狗搜索了下。原来现在只能是先create user再进行grant。

1
2
CREATE USER 'douban'@'127.0.0.1' IDENTIFIED BY '123456';
GRANT ALL ON *.* TO 'root'@'${MYSQL_ROOT_HOST}' WITH GRANT OPTION;

好了,数据库的解决好了就看看python3下的mysql的驱动有哪些了,发现pymysql最合适,那就装上了,可再实际跑的时候一直各种报错啊。

就很简单的几条调用还居然报错了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Traceback (most recent call last):
File "doubanSpider.py", line 220, in <module>
do_spider(book_tag_lists)
File "doubanSpider.py", line 168, in do_spider
'INSERT INTO bookinfo (book_title, book_douban_id,book_rate,book_author,book_rate_user,book_press,book_press_date,book_price,book_tag) VALUES ("%s", %d, %f, "%s", %d, "%s", "%s", %f, "%s")', (book_list[0],))
File "/Users/timo/.pyenv/versions/venv369/lib/python3.6/site-packages/pymysql/cursors.py", line 199, in executemany
self.rowcount = sum(self.execute(query, arg) for arg in args)
File "/Users/timo/.pyenv/versions/venv369/lib/python3.6/site-packages/pymysql/cursors.py", line 199, in <genexpr>
self.rowcount = sum(self.execute(query, arg) for arg in args)
File "/Users/timo/.pyenv/versions/venv369/lib/python3.6/site-packages/pymysql/cursors.py", line 168, in execute
query = self.mogrify(query, args)
File "/Users/timo/.pyenv/versions/venv369/lib/python3.6/site-packages/pymysql/cursors.py", line 147, in mogrify
query = query % self._escape_args(args, conn)
TypeError: %d format: a number is required, not str

我数据表里有int类型啊,这里也写了%d了,可为什么一直报错呢。只能继续放狗搜索下,发现居然使用pymysql的时候,所有字段不管什么类型都是使用%s
https://blog.csdn.net/jy1690229913/article/details/79407224

然后小白小朋友还特地去看看为什么会这样
https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/converters.py
就是这个文件的第324行和53,54行说明了这个问题,居然这个里面把所有的都变成了string了,可你们的帮助文档里压根什么都不说明。

help(pymysql.cursors)里面就只写了

1
2
3
4
5
6
7
8
9
10
|  executemany(self, query, args)
| Run several data against one query
|
| :param query: query to execute on server
| :param args: Sequence of sequences or mappings. It is used as parameter.
| :return: Number of rows affected, if any.
|
| This method improves performance on multiple-row INSERT and
| REPLACE. Otherwise it is equivalent to looping over args with
| execute().

这种帮助就跟没有我看也没什么区别啊。