首页 > 编程学习 > Scrapy爬取猫眼电影评论

Scrapy爬取猫眼电影评论

发布时间:2022/1/17 12:49:34

Scrapy爬取猫眼电影评论

文章目录

  • Scrapy爬取猫眼电影评论
    • 1、寻找评论接口
    • 2、分析接口URL
      • 接口URL规律
      • 构造URL接口
      • 分析JSON参数
    • 3、Scrapy代码
      • spiders文件
      • Item文件
      • piplines
      • settings文件
    • 4、爬取结果
    • 5、Scrapy-Redis
      • 修改爬虫文件
      • 修改setting文件
      • 进行分布式部署

目标: 地址

1、寻找评论接口

将浏览器模式从PC切换为手机


2、分析接口URL

第一个URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=0&startTime=0
第二个URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=15&startTime=2018-10-11%2015%3A19%3A05
第三个URL:http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=30&startTime=2018-10-11%2015%3A19%3A05

接口URL规律

offset=0时,startTime也为0,之后就是offest每次增加15,startTime也变为固定时间。

我们来看第一条评论(非热评)时间3分钟前,startTime时间是2018-10-11 15:19:05,我电脑的时间是15:22:04所以,这个startTime时间就是最新一条评论时间。

构造URL接口

  • 1216446:表示电影id
  • offset:表示偏移量
  • startTime:最新一条评论的时间

我们获取最新一条评论的时间,设为固定值,然后将offset每次便宜量增加15就成功构造该请求了。

分析JSON参数

  • cmts:普通评论,每次获取15条,因为便宜量offset为15。
  • hcmts:热门评论10条
  • total:评论总数
{
"approve": 3913,
"approved": false,
"assistAwardInfo": {
"avatar": "",
"celebrityId": 0,
"celebrityName": "",
"rank": 0,
"title": ""
},
"authInfo": "",
"avatarurl": "https://img.meituan.net/avatar/7e9e9348115c451276afffda986929b311657.jpg",
"cityName": "深圳",
"content": "脑洞很大,有创意,笑点十足又有泪点,十分感动,十分推荐。怀着看喜剧电影去看的,最后哭了个稀里哗。确实值得一看,很多场景让我回忆青春,片尾的旧照片更是让我想起了小时候。",
"filmView": false,
"gender": 1,
"id": 1035829945,
"isMajor": false,
"juryLevel": 0,
"majorType": 0,
"movieId": 1216446,
"nick": "lxz367738371",
"nickName": "发白的牛仔裤",
"oppose": 0,
"pro": false,
"reply": 94,
"score": 5,
"spoiler": 0,
"startTime": "2018-08-17 03:30:37",
"supportComment": true,
"supportLike": true,
"sureViewed": 0,
"tagList": {},
"time": "2018-08-17 03:30",
"userId": 1326662323,
"userLevel": 2,
"videoDuration": 0,
"vipInfo": "",
"vipType": 0
},
  • cityname:所在城市
  • content:评论内容
  • gender:性别
  • id:评论者的id
  • nickname:评论者昵称
  • userlevel:评论者猫眼等级
  • score:评分(满分5)
  • time:评论时间

3、Scrapy代码

spiders文件

构造起始请求URL

class Movie1Spider(scrapy.Spider):
    name = 'movie1'
    allowed_domains = ['m.maoyan.com']
    base_url = 'http://m.maoyan.com/mmdb/comments/movie/{}.json?_v_=yes&offset={}&startTime={}'

    def start_requests(self):
        time_now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        url = self.base_url.format(MOVIE_ID, 0, quote(time_now))
        yield Request(url=url)

JSON数据中获取,参数信息。
在进行爬取的时候当固定一个时间并不能一直爬取,当一个固定时间只能爬取到offset=1005,再往后面就没数据了,当爬取到第一条评论的时候后,再往前爬取,会得到电影上映时间的评论。
所以代码终止的条件就是,当评论中的时候大于URL中的请求时间。

    def parse(self, response):
        last_time = re.search(r'startTime=(.*)', response.url).group(1)  # url中的时间
        response = json.loads(response.text)
        cmts = response.get('cmts')
        for cmt in cmts:
            global time
            maoyan_item = MaoyanItem()
            maoyan_item['id'] = cmt.get('id')
            maoyan_item['nickname'] = cmt.get('nickName')
            maoyan_item['gender'] = cmt.get('gender')
            maoyan_item['cityname'] = cmt.get('cityName')
            maoyan_item['content'] = cmt.get('content')
            maoyan_item['score'] = cmt.get('score')
            time = cmt.get('startTime')
            maoyan_item['time'] = time
            maoyan_item['userlevel'] = cmt.get('userLevel')
            if quote(time) > last_time:  # 当评论的时间大于url中的时间
                break
            yield maoyan_item
        if quote(time) < last_time:  # 最后一条评论小于url中的时间
            url = self.base_url.format(MOVIE_ID, 15, quote(time))   # 使用评论最后一条的时间
            yield Request(url=url, meta={'next_time': time})

Item文件

class MaoyanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    table = 'movie'
    id = Field()  # ID
    nickname = Field()  # 名称
    gender = Field()  # 性别
    cityname = Field()  # 城市名称
    content = Field()  # 评论内容
    score = Field()  # 评分
    time = Field()  # 评论时间
    userlevel = Field()  # 评论者等级

piplines

保存信息到Mysql数据库

class MaoyanPipeline(object):
    def __init__(self, host, databases, user, password, port):
        self.host = host
        self.databases = databases
        self.user = user
        self.password = password
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get('MYSQL_HOST'),
            databases=crawler.settings.get('MYSQL_DATABASES'),
            user=crawler.settings.get('MYSQL_USER'),
            password=crawler.settings.get('MYSQL_PASSWORD'),
            port=crawler.settings.get('MYSQL_PORT'),
        )

    def open_spider(self, spider):
        try:
            self.db = pymysql.connect(self.host, self.user, self.password, self.databases, charset='utf8',
                                      port=self.port)
            self.db.ping()
        except:
            self.db = pymysql.connect(self.host, self.user, self.password, self.databases, charset='utf8',
                                      port=self.port)
        self.curosr = self.db.cursor()

    def process_item(self, item, spider):
        data = dict(item)
        keys = ','.join(data.keys())
        values = ','.join(['%s'] * len(data))
        sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
        self.curosr.execute(sql, tuple(data.values()))
        self.db.commit()
        return item

    def close_spider(self, spider):
        self.db.close()

settings文件

适当降低爬取的延迟,以及添加Headers,配置Mysql的信息,开启piplines

BOT_NAME = 'maoyan'
SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'
DEFAULT_REQUEST_HEADERS = {
    'Referer': 'http://m.maoyan.com/movie/1216446/comments?_v_=yes',
    'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) '
                  'Version/11.0 Mobile/15A372 Safari/604.1'
}
ITEM_PIPELINES = {
    'maoyan.pipelines.MaoyanPipeline': 300,
}
MYSQL_HOST = ''
MYSQL_DATABASES = 'movie'
MYSQL_PORT =
MYSQL_USER = 'root'
MYSQL_PASSWORD = ''

DOWNLOAD_DELAY = 0.1  # 每次下载请求的延迟
MOVIE_ID = '1216446'  # 电影ID

4、爬取结果

5、Scrapy-Redis

由于评论过多,我们用分布式爬取的话会更快。

修改爬虫文件

  • 首先需要引入RedisSpiderfrom scrapy_redis.spiders import RedisSpider
  • 将父类继承中的Spider修改为RedisSpider
  • 因为要从redis数据库中爬取链接信息,所以去掉start_urls,并添加redis_key
class Movie1Spider(RedisSpider):
    name = 'movie1'
    allowed_domains = ['m.maoyan.com']
    base_url = 'http://m.maoyan.com/mmdb/comments/movie/{}.json?_v_=yes&offset={}&startTime={}'
    redis_key = 'movie1:start_urls'

    # def start_requests(self):
    #     time_now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    #     url = self.base_url.format(MOVIE_ID, 0, quote(time_now))
    #     yield Request(url=url)

修改setting文件

  • 指定redis数据库链接参数:REDIS_URL
  • 指定scrapy-redis的调度器:SCHEDULER
  • 指定scrapy-redis的去重:DUPEFILTER_CLASS
  • 设置断点续传,不清理redis queue:SCHEDULER_PERSIST
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://:密码@IP:6379'
SCHEDULER_PERSIST = True
MYSQL_HOST = '地址'
MYSQL_DATABASES = 'movie'
MYSQL_PORT = 62782
MYSQL_USER = 'root'
MYSQL_PASSWORD = ''
DOWNLOAD_DELAY = 0.1  # 每次下载请求的延迟
MOVIE_ID = '1216446'  # 电影ID

启动程序后,我们链接Redis数据库,进行单机测试是否可以。

127.0.0.1:6379> lpush dytt:start_urls http://m.maoyan.com/mmdb/comments/movie/1216446.json?_v_=yes&offset=0&startTime=2018-10-11%2018%3A14%3A17

进行分布式部署

使用Gerapy批量部署

Copyright © 2010-2022 ngui.cc 版权所有 |关于我们| 联系方式| 豫B2-20100000