财经

财经商店自媒体听了吴晓波那个话,恐怕会躲在洗手间里哽咽

25 3月 , 2019  

1.安装

吴晓波对姑娘兴高采烈说:几时你爹挂掉了,墓志铭上刻这一行字就好了:那里躺着二个作品等身的人。

2.使用scrapy startproject  project_name 命令创造scrapy项目

她曾把温馨写过和主要编辑过的书垒在地上,一本一本地加上去,发现已超过膝盖逼近大腿。他算了一下,假如从来这么写下去,三四十年后迟早能够一米八。他的贰个本人期许是,希望这一米八高的书中有几多本在几十年后还是能重印,还会偶尔被提及。

如图:

她是个劳碌的人,近年来又出新书《腾讯传》,首印的50万册据他们说已经卖了二分之一,未来,《腾讯传》天天以三千册销量持续热销。他也被叫作中夏族民共和国最非凡的金融小说家,让严肃的购买销售写作跻身畅销书行列。

财经 1

除却大手笔,吴晓波的另三个地点是投资人。他协同治帝理中中原人民共和国际联盟手人曹国熊等人另起炉灶了“狮享家新媒体基金”,并且不断动手,达成了对“餐饮总裁内部参考新闻”、“酒业家”、“12缸小车”、“灵魂的菲菲的女性”、“张德芬”、
“杜绍斐”等四个微信公号的投资。

3.依据提醒使用scrapy genspider spider_name domain_url
创立二个spider程序

2014年十一月,他在一篇小说中写道:“简直之间,2个以基金为枢纽、以中产知识人群为对象的自媒体矩阵形成了。到近年来截止,这一矩阵中的内容公司已当先21个,覆盖约2500万中产用户。”

如图:

专业蜚言,吴晓波投资公众号,估值在同行业中偏高。然而,自媒体想赢得“狮享家新媒体基金”的投资,也绝非易事。吴晓波曾代表,他只投垂直门类的率先名、第一名,第壹名恐怕都不太会投;此外,那个自媒体指标人群在白领、中产以上,且商业格局清晰,并已兑现盈利。

财经 2

▲吴晓波

注意spider_name 不能和project_name
同名;能够使用关键字加_spider的措施命名spider_name

吴晓波也是一位美好的自媒体人。2015年三月,他推出国内首档财政和经济脱口秀节目《吴晓波频道》,在天涯论坛上,那档节目未播先火,博得众多分量级金融人员喝彩。与录像节目一同上线的,还有微信公众号“吴晓波频道”和种种社会群众体育活动。

 4. 运用scrapy list命令查看全数的spider程序名称,及运用scrapy
–help查看命令列表

今后,“吴晓波频道”堪称国内自媒体社会群众体育运维的样子。他主持自媒体的O2O部分,把线上的用户转到线下去,发起“咖啡馆改造安插”、“年货众筹”等运动。

财经 3

行使公众号的圈层和集聚效益,他开设线下讲座,在费城一遍千人讲座上,他说“网络把拥有渠道都扁平化了,要是在非网络时代,小编急需三个13分庞大的武装力量去运营那件事。而后天自身在微信上推送叁个,就有400四个体报名。那样小编的沟渠费用就归零了。”那堂“千人民代表大会课”的收款分三档:九千元/人、25500元/几个人、7000元/11位。

财经 4

二零一四年7月十日,“吴晓波频道”推出收费音频产品《天天听见吴晓波》,定价180元/年,在不久三个月时间内卖出10万份,使其变为内容创业领域最畅销的“知识付费”产品之一。

 5. 使用scrapy shell url 调试selector

本年11月,“吴晓波频道”运转公司——新加坡巴九灵文化传播有限公司形成A轮融通资金1.6亿元,投后估值20亿元。
昨日,吴晓波做客汇志传播媒介,并和经营做了互相,问答如下:

第1,使用scrapy  shell  url 打开scrapy shell调节和测试工具

Q:今年比比皆是自媒体的数码相比较劳累,您当年还有投资自媒体的来意吗?会注重哪方面包车型客车价值?

如图:

A:二〇一九年依旧投了不少自媒体,有做母亲和婴儿的、自行车用户的。某些自媒体融通资金也不利,“十点读书”刚刚融完3.多少个亿。也不是说惨淡,今年(依旧)是自媒体变现的年度。

财经 5

****Q:当前自媒体“冒充真的”情形泛滥,僵尸粉无处不在,您认为那种“制造假的”对自媒体的生态会招致什么的熏陶?

财经 6

A:对投资人来讲,公众号的观者数量并不曾什么价值,投资人只怕更侧重的是你是或不是在细分领域的前三名、营收有没有明晰的商业形式以及运维业收入入是或不是处于直线上升的通道里,那一个地点实际上很难作假的,包涵用刷单、僵尸粉、用户都没有办法伪造。

在scrapy shell里面输入要摸索的selector

****Q:您怎么看日前最火的短录像?

如图:

A:短录像非常大概是个泡泡,变现形式很差。现在对短录像最爱抚的是一些平台,他们是从打开率、留存率、时间长度方面去考虑的。对于平台来讲,短摄像的低收入形式是清楚的。

财经 7

对认真做短录制的人来讲,毛利形式也就三种,第3种是广告情势,第①种是卖货格局,第三种是知识付费形式。那二种格局近期都不明朗。比如说如涵的张大奕方式也算不上短录像格局,他们是透过人格化网红卖面膜才令商户估值33亿。

财经 8

****Q:你怎么看待UC、一点谍报、百度百家等吸引自媒体流量的阳台,您认为日前她们处于一个好的升华形态呢?

 6. 定义items.py中的字段

A:自家认为不是。作者认为UC根本没机会,百度百家也没怎么机会,那类平台基本上都以靠流量生存。而前些天红利基本上集中在微信端。或许90后、三次元的用户会多集中在天涯论坛、QQ多或多或少,那种相比较便于火。别的的主干没戏。

所必要的字段有image_url和title

****Q:家用电器行业属于守旧创造业,未来公司一支团队切入自媒体市集有没有倾向?该往哪些角度发力?

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ItcastItem(scrapy.Item):
    # define the fields for your item here like:
    image_url = scrapy.Field()
    title = scrapy.Field()
    pass

A:本人认为没有动向。咱们看做得比较好的Haier属于人格化的品种,融合能力相比较强。其完毕在的营业所自媒体都未曾起来,主因有二:

  1. 在开立好的spider程序中itcast_spider.py中期维修改代码

1.看作单一公司来营业是广告模型,除非公司来买单,把自媒体作为P奥迪Q5的有的来做,不然作为公共传播来说是从未价值的;

率先,修改配置文件settings.py

2.大家清楚新媒体必要全体可不止的生产能力,其实二个家用电器公司并不曾那么多话要对顾客讲的。

将ROBOTSTXT_OBEY = True改为ROBOTSTXT_OBEY = False

于是微信公众号状态下,曾经的部分天涯论坛中号以后却影响力不再,比如姚晨、SOHO中国开创者潘石屹,关键在于微信公众号是内容模型,而博客园是广场模型、戏剧化的发表模型。所以在理性的众生号环境下,集团还要很理性地跟我们讲,是从未有过动向的。

在次,修改itcast_spider.py

以董明珠(Mingzhu Dong)为例,格力COO董明珠也是在有言要发的时候“董明珠(Mingzhu Dong)自媒体”才会更新,她实在没有太多东西必要对用户讲。

# -*- coding: utf-8 -*-
import scrapy
from itcast.items import ItcastItem

class ItcastSpiderSpider(scrapy.Spider):
    name = 'itcast_spider'
    allowed_domains = ['itcast.cn']
    start_urls = ['http://itcast.cn/']

    def parse(self, response):
        node_list = response.xpath("//li[@class='a_gd']")
        for index, node in enumerate(node_list):
            item = ItcastItem()

            if(len(node.xpath("./p[@class='img_box']/a/img/@src"))):
                item['image_url'] = node.xpath("./p[@class='img_box']/a/img/@src").extract()[0]
            else:
                item['image_url'] = ""

            if(len(node.xpath("./p/a[@se_prerender_url='complete']/text()"))):
                item['title'] = node.xpath("./p/a[@se_prerender_url='complete']/text()").extract()[0]
            else:
                item['title'] = ""

            if(item['image_url'] or item['title']):
                yield item

就此你帮公司做号以来,除非她买单。假诺他当真买单,那就把它变成P兰德翼虎。

  1. 使用scrapy crawl  spider_name 生成json数据文件

做家用电器自媒体,把它做成三个媒体是没有意义的。拿自个儿最初投资的民众号“餐饮CEO内部参考消息”来说,它的意思在于能打通供应链,它还是可以形成能帮中小餐饮集团提供短贷,做培养和演练营,未来还可以够够帮别人做品牌。餐饮关键是中型小型公司多,而家用电器是豪门伙多,所以假设您创设了一个传播意义上的小家用电器自媒体,没有其余价值。中华夏族民共和国现今曾经不要求二个近似“中中原人民共和国家用电器报”那样的媒体。

首先,使用scrapy runspider spider_file_name试运行spider程序

****Q:何以未来有些商户很想做社会群众体育却做不起来?基于产品做社会群众体育和基于内容做社群有何样不一致?

如图:

A:咱俩在举国8三个都市有搞过书友会,两年多下去其实也油不过生了费力。小编得出两个结论,借使社群是确立在应酬前提之下,就不难成为1个轻浮的东西。所以作者觉着真正有价值的社会群众体育应该是反社交的职务型社会群众体育,有仪式感。社交只是社会群众体育的历程还是结果。大家不少时候把社会群众体育做成了应酬,其实是平昔不别的意义的。第三,有偿性社群和公共利益性社会群众体育应该做严峻区分,不可能将两边交织到手拉手卖货,不然会不难让用户救经引足。

财经 9

3个群主不能够既是经贸的群主又当公共利益的群主。作为社会群体的倡导人来讲,要抱有反社交的情怀。从集团角度来讲,建设职分型社会群众体育是相比较稳当的。借社会群众体育达成商业目标是一件蛮危险的作业。

财经 10

终极,好久没送便利了,今天来一波。

财经 11

经营将在下方评论区选择两条最好留言各送上航海用教室的书一本,名额周日发表。

试运作发现已经打印出了数量

其余,凡是被选拔的留言都可获取下方图示的门票两张。

然后,使用scrapy crawl spider_name -o  data.json生成json数据文件

3万平米地方,集聚30国天下电商平台、互连网商业服务商、供应链品牌分销商、革新创业社会群众体育、智能与科创应用商等近600家,专业观者约4万人。IEBE为你呈现圆满的国际电商生态。

财经 12

查阅生成的数据文件data.json

财经 13

 9. 使用管道将数据持久化到mysql

首先, 创制好的数据库名为:

主机:127.0.0.1

用户名:root

密码:root

数据库名称:test

数据表结构:

CREATE TABLE `itcast` (
`image_url` varchar(255) NOT NULL,
`title` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

 

 然后,在布署文件settings.py中开辟pipelines功效:

财经 14

然后,编写pipelines.py中的持久化数据处理逻辑

# -*- coding: utf-8 -*-

import pymysql
import json
class ItcastPipeline(object):
    def __init__(self):
        self.conn = pymysql.connect(host='127.0.0.1', user='root', passwd='root', db='test', charset='UTF8')
        self.cursor = self.conn.cursor()
    def process_item(self, item, spider):
        item = dict(item)
        image_url = item['image_url']
        title = item['title']

        sql = "insert into itcast(image_url, title) values('%s', '%s')" % (image_url, title)
        self.cursor.execute(sql)
        self.conn.commit()
        return item
    def close_spider(self, spider):
        self.conn.close()

  

选用1,使用scrapy抓取腾讯招贤的工作招聘消息,并储存在该地json文件中

步骤:

1.开立一个scrapy项目

scrapy startproject tencert

2.成立一个spider类

cd tencert

scrapy genspider tencert_hr

3.定义item字段

# -*- coding: utf-8 -*-

import scrapy

class TencertItem(scrapy.Item):
    position_name = scrapy.Field()
    position_link = scrapy.Field()
    position_category = scrapy.Field()
    position_person_count = scrapy.Field()
    position_city = scrapy.Field()
    created_at = scrapy.Field()

4.编写spider程序

# -*- coding: utf-8 -*-
import scrapy
from tencert.items import TencertItem

class TencertHrSpider(scrapy.Spider):
    name = 'tencert_hr'
    allowed_domains = ['http://hr.tencent.com/']
    start_urls = ['http://hr.tencent.com/position.php?keywords=%E8%AF%B7%E8%BE%93%E5%85%A5%E5%85%B3%E9%94%AE%E8%AF%8D&lid=0&tid=0']

    def parse(self, response):
        node_list = response.xpath("//tr[@class='even' or @class='odd']")
        for node in node_list:
            item = TencertItem()
            if len(node.xpath("./td[1]/a/text()")):
                item['position_name'] = node.xpath("./td[1]/a/text()").extract()[0]
            else:
                item['position_name'] = ""

            if len(node.xpath("./td[1]/a/@href")):
                item['position_link'] = node.xpath("./td[1]/a/@href").extract()[0]
            else:
                item['position_link'] = ""

            if len(node.xpath("./td[2]/text()")):
                item['position_category'] = node.xpath("./td[2]/text()").extract()[0]
            else:
                item['position_category'] = ""

            if len(node.xpath("./td[3]/text()")):
                item['position_person_count'] = node.xpath("./td[3]/text()").extract()[0]
            else:
                item['position_person_count'] = ""

            if len(node.xpath("./td[4]/text()")):
                item['position_city'] = node.xpath("./td[4]/text()").extract()[0]
            else:
                item['position_city'] = ""

            if len(node.xpath("./td[5]/text()")):
                item['created_at'] = node.xpath("./td[5]/text()").extract()[0]
            else:
                item['created_at'] = ""

            yield item

        if(len(response.xpath("//a[@id='next' and @class='noactive']")) == 0):
            url = "http://hr.tencent.com/" + response.xpath("//a[@id='next']/@href").extract()[0]
            yield scrapy.Request(url, callback=self.parse, dont_filter=True)

注意:

a. 不要在spider程序中对字段的值encode(“UTF-8”), 那样会导致出现byte错误;
正确的艺术是在pipelines.py中开拓文件的句柄中钦点编码为UTF-8

b. 供给对字段的值做是还是不是为空的论断,即便字段值为空,则取不到下标为0的值

c. 在spider程序中生成三个request使用yield scrapy.Request(url,
callback=self.parse, dont_filter=True)

5.编纂pipeline管道程序

# -*- coding: utf-8 -*-
import json

class TencertPipeline(object):
    def __init__(self):
        self.f = open("data.json", "w", encoding="UTF-8")
    def process_item(self, item, spider):
        json_str = json.dumps(dict(item), ensure_ascii=False) + ", \n"
        self.f.write(json_str)
        return item
    def close_spider(self, spider):
        self.f.close()

注意:

a. 在管道的构造方法中打开文件的情势应用w 和使用
a的坚守是相同的,因为spider通过yield生成器将数据传给管道,这样的通信情势生生不息,那样管道类只需实施2遍构造方法

b. 打开问价时钦赐编码为UTF-8

 

6.改动配置文件开启管道

# -*- coding: utf-8 -*-

BOT_NAME = 'tencert'

SPIDER_MODULES = ['tencert.spiders']
NEWSPIDER_MODULE = 'tencert.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tencert (+http://www.yourdomain.com)'

# Obey robots.txt rules
#ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'tencert.middlewares.TencertSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'tencert.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tencert.pipelines.TencertPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

注意:

a. 必要在配置文件中拉开管道

b. 须要在安插文件中关闭ROBOTSTXT_OBEY 或指定ROBOTSTXT_OBEY = False

 

7.运行spider程序

scrapy crawl tencert_hr

 

应用2,使用scrapy爬取手提式有线电电话机端动态ajax页面包车型地铁图纸数据,并由此图片管道下载图片到地点,仁同一视命名

  1. 创建scrapy项目

scrapy startproject douyu

  1. 生成spider程序

cd douyu

scrapy genspider douyu_girl

  1. 定义item字段

    # –– coding: utf-8 –

    import scrapy

    class DouyuItem(scrapy.Item):

     nickname = scrapy.Field()
     imageurl = scrapy.Field()
     city = scrapy.Field()
    

  

  1. 编写spider程序

    # –– coding: utf-8 –
    import scrapy
    import json

    from douyu.items import DouyuItem

    class DouyuGirlSpider(scrapy.Spider):

     name = 'douyu_girl'
     allowed_domains = ['douyucdn.cn']
    
     base_url = 'http://capi.douyucdn.cn/api/v1/getVerticalroom?limit=20&offset='
     offset = 0
     start_urls = [base_url + str(offset)]
    
     def parse(self, response):
         girl_list = json.loads(response.body)['data']
         if(len(girl_list) == 0):
             return
         for girl in girl_list:
             item = DouyuItem()
             item['imageurl'] = girl['vertical_src']
             item['nickname'] = girl['nickname']
             item['city'] = girl['anchor_city']
    
             yield item
    
         self.offset += 20
         url = self.base_url + str(self.offset)
         yield scrapy.Request(url, callback=self.parse)
    

  

  1. 编排水管道道处理程序

    # –– coding: utf-8 –
    import scrapy
    import os

    from douyu.settings import IMAGES_STORE
    from scrapy.pipelines.images import ImagesPipeline

    class DouyuPipeline(ImagesPipeline):

     def get_media_requests(self, item, info):
         yield scrapy.Request(item['imageurl'])
    
     def item_completed(self, results, item, info):
         isok = results[0][0]
         if(isok):
             path = results[0][1]['path']
             nickname = item['nickname']
             city = item['city']
             old_path = IMAGES_STORE + path
             new_path = IMAGES_STORE + nickname + "_" + city + ".jpg"
             os.rename(old_path, new_path)
         return item
    

注意:

a. 图片的下载直接选用ImagePipeline管道类,
可直接在get_media_requests(self, item, info) 方法中下载图片;

 直接在item_completed(self, results, item, info): 方法中重命名图片

b. 涉及图片要求python安装pillow模块, 没有安装会报错:

ModuleNotFoundError: No module named ‘PIL’

则利用pip install pil 提醒没有该模块

查资料发现pil已被pillow代替,则动用pip install pillow 难题一下子就解决了

 

  1. 修改配置文件

    # –– coding: utf-8 –

    BOT_NAME = ‘douyu’

    SPIDER_MODULES = [‘douyu.spiders’]
    NEWSPIDER_MODULE = ‘douyu.spiders’

    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = ‘douyu (+http://www.yourdomain.com)’

    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False

    IMAGES_STORE = “D:/test/scrapy/douyu”

    USER_AGENT = “Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32

    # Configure a delay for requests for the same website (default: 0)
    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16

    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False

    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False

    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    # ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8′,
    # ‘Accept-Language’: ‘en’,
    #}

    # Enable or disable spider middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    # ‘douyu.middlewares.DouyuSpiderMiddleware’: 543,
    #}

    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    # ‘douyu.middlewares.MyCustomDownloaderMiddleware’: 543,
    #}

    # Enable or disable extensions
    # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    # ‘scrapy.extensions.telnet.TelnetConsole’: None,
    #}

    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {

    'douyu.pipelines.DouyuPipeline': 300,
    

    }

    # Enable and configure the AutoThrottle extension (disabled by default)
    # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False

    # Enable and configure HTTP caching (disabled by default)
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = ‘httpcache’
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage’

注意:

a.
那里的伸手url重临的是json,那是手提式有线话机端的ajax请求,必要模拟手提式有线电话机浏览器发出请求,即采纳user_agent,
直接在陈设文件中布置USE奇骏_AGENT常量

b. 开启了利用ImagePipeline则需求在配备文件中钦点图片存放目录,
直接在布置文件中定义常量IMAGES_STORE

 

  1. 运行spider程序

 scrapy crawl douyu_girl

 

利用3,使用scrapy爬取易网财政和经济的具有上市公司的财务报表

对易网财政和经济的网站开始展览辨析,易网财政和经济的表格url为:

http://quotes.money.163.com/hs/marketdata/service/cwsd.php?host=/hs/marketdata/service/cwsd.php&page=10&query=date:2017-06-30&fields=NO,SYMBOL,SNAME,PUBLISHDATE,MFRATIO28,MFRATIO18,MFRATIO20,MFRATIO10,MFRATIO4,MFRATIO2,MFRATIO12,MFRATIO23,MFRATIO25,MFRATIO24,MFRATIO122&sort=MFRATIO28&order=desc&count=25&type=query&initData=\[object%20Object\]&callback=callback\_1488472914&req=31556

个中page为页面分页编号, page=0为第2页

步骤:

  1. 创建scrapy项目

项目名称为yimoney

scrapy startproject yimoney

 

  1. 创建spider程序

spider名称为yimoney_account

cd yimoney

scrapy genspider yimongy_account “quotes.money.163.com”

 

  1. 定义item数据结构

    # –– coding: utf-8 –

    import scrapy

    class YimoneyItem(scrapy.Item):

     # 股票代码
     symbol = scrapy.Field()
     # 股票名称
     sname = scrapy.Field()
     # 报告日期
     publishdate = scrapy.Field()
     # 基本每股收益
     income_one = scrapy.Field()
     # 每股净资产
     income_one_clean = scrapy.Field()
     # 每股经营现金流
     cash_one = scrapy.Field()
     # 主营业务收入(万元)
     income_main = scrapy.Field()
     # 主营业务利润(万元)
     profit_main = scrapy.Field()
     # 净利润(万元)
     profit_clean = scrapy.Field()
     # 总资产(万元)
     property_all = scrapy.Field()
     # 流动资产(万元)
     property_flow = scrapy.Field()
     # 总负债(万元)
     debt_all = scrapy.Field()
     # 流动负债(万元)
     debt_flow = scrapy.Field()
     # 净资产(万元)
     property_clean = scrapy.Field()
    

  

  1. 编写spider程序

    # –– coding: utf-8 –
    import scrapy
    import json

    from yimoney.items import YimoneyItem

    class YimongyAccountSpider(scrapy.Spider):

     name = 'yimongy_account'
     allowed_domains = ['quotes.money.163.com']
    
     page = 0
     account_url = 'http://quotes.money.163.com/hs/marketdata/service/cwsd.php?host=/hs/marketdata/service/cwsd.php&page=' \
                   + str(page) + \
                   '&query=date:2017-06-30&fields=NO,SYMBOL,SNAME,PUBLISHDATE,MFRATIO28,MFRATIO18,MFRATIO20,MFRATIO10,MFRATIO4,MFRATIO2,MFRATIO12,MFRATIO23,MFRATIO25,MFRATIO24,MFRATIO122&sort=MFRATIO28&order=desc&count=25&type=query&initData=[object%20Object]&callback=callback_1488472914&req=31556'
     start_urls = [account_url]
    
     def parse(self, response):
         data_dict = response.body[20:-1].decode('UTF-8')
         list = dict(json.loads(data_dict))['list']
         if(len(list) == 0):
             return
         for one in list:
             item = YimoneyItem()
    
             item['symbol'] = one['SYMBOL']
             item['sname'] = one['SNAME']
             item['publishdate'] = one['PUBLISHDATE']
    
             if 'MFRATIO28' in one:
                 item['income_one'] = one['MFRATIO28']
             else:
                 item['income_one'] = ''
    
             if 'MFRATIO18' in one:
                 item['income_one_clean'] = one['MFRATIO18']
             else:
                 item['income_one_clean'] = ''
    
             if 'MFRATIO20' in one:
                 item['cash_one'] = one['MFRATIO20']
             else:
                 item['cash_one'] = ''
    
             item['income_main'] = one['MFRATIO10']
             if 'MFRATIO10' in one:
                 item['income_main'] = one['MFRATIO10']
             else:
                 item['income_main'] = ''
    
             item['profit_main'] = one['MFRATIO4']
             if 'MFRATIO4' in one:
                 item['profit_main'] = one['MFRATIO4']
             else:
                 item['profit_main'] = ''
    
             item['profit_clean'] = one['MFRATIO2']
             if 'MFRATIO2' in one:
                 item['profit_clean'] = one['MFRATIO2']
             else:
                 item['profit_clean'] = ''
    
             item['property_all'] = one['MFRATIO12']
             if 'MFRATIO12' in one:
                 item['property_all'] = one['MFRATIO12']
             else:
                 item['property_all'] = ''
    
             item['property_flow'] = one['MFRATIO23']
             if 'MFRATIO23' in one:
                 item['property_flow'] = one['MFRATIO23']
             else:
                 item['property_flow'] = ''
    
             item['debt_all'] = one['MFRATIO25']
             if 'MFRATIO25' in one:
                 item['debt_all'] = one['MFRATIO25']
             else:
                 item['debt_all'] = ''
    
             item['debt_flow'] = one['MFRATIO24']
             if 'MFRATIO24' in one:
                 item['debt_flow'] = one['MFRATIO24']
             else:
                 item['debt_flow'] = ''
    
             item['property_clean'] = one['MFRATIO122']
             if 'MFRATIO122' in one:
                 item['property_clean'] = one['MFRATIO122']
             else:
                 item['property_clean'] = ''
    
             yield item
    
         self.page += 1
         url = account_url = 'http://quotes.money.163.com/hs/marketdata/service/cwsd.php?host=/hs/marketdata/service/cwsd.php&page=' \
                   + str(self.page) + \
                   '&query=date:2017-06-30&fields=NO,SYMBOL,SNAME,PUBLISHDATE,MFRATIO28,MFRATIO18,MFRATIO20,MFRATIO10,MFRATIO4,MFRATIO2,MFRATIO12,MFRATIO23,MFRATIO25,MFRATIO24,MFRATIO122&sort=MFRATIO28&order=desc&count=25&type=query&initData=[object%20Object]&callback=callback_1488472914&req=31556'
         yield scrapy.Request(url, callback=self.parse)
    

  

  1. 编写管道处理程序

    # –– coding: utf-8 –

    import json

    class YimoneyPipeline(object):

     def __init__(self):
         self.f = open("data.josn", "w", encoding="UTF-8")
    
     def process_item(self, item, spider):
         json_str = json.dumps(dict(item), ensure_ascii=False)
         self.f.write(json_str + ",\r\n")
         return item
    
     def close_spider(self, spider):
         self.f.close()
    

  

  1. 修改配置文件

    # –– coding: utf-8 –

    BOT_NAME = ‘yimoney’

    SPIDER_MODULES = [‘yimoney.spiders’]
    NEWSPIDER_MODULE = ‘yimoney.spiders’

    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = ‘yimoney (+http://www.yourdomain.com)’

    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False

    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32

    # Configure a delay for requests for the same website (default: 0)
    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16

    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False

    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False

    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    # ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8′,
    # ‘Accept-Language’: ‘en’,
    #}

    # Enable or disable spider middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    # ‘yimoney.middlewares.YimoneySpiderMiddleware’: 543,
    #}

    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    # ‘yimoney.middlewares.MyCustomDownloaderMiddleware’: 543,
    #}

    # Enable or disable extensions
    # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    # ‘scrapy.extensions.telnet.TelnetConsole’: None,
    #}

    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {

    'yimoney.pipelines.YimoneyPipeline': 300,
    

    }

    # Enable and configure the AutoThrottle extension (disabled by default)
    # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False

    # Enable and configure HTTP caching (disabled by default)
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = ‘httpcache’
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage’

  

  1. 运行spider程序

 scrapy crawl  yimongy_account


相关文章

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图