読者です 読者をやめる 読者になる 読者になる

mechanizeでgetしたhtmlが化けているっぽいんです

引き続きkoboをクローリングする文脈で、です。

irb(main):001:0> require 'mechanize'
=> true
irb(main):002:0> agent = Mechanize.new
=> #<Mechanize:0x410eda0 @agent=#<Mechanize::HTTP::Agent:0x410ed88 @allowed_error_codes=[], @conditional_requests=true, @context=#<Mechanize:0x410eda0 ...>, @content_encoding_hooks=[], @cookie_jar=#<Mechanize::CookieJar:0x410ed40 @store=#<HTTP::CookieJar::HashStore:0x410dbd0 @mon_owner=nil, @mon_count=0, @mon_mutex=#<Mutex:0x410dba0>, @logger=nil, @gc_threshold=150, @jar={}, @gc_index=0>>, @follow_meta_refresh=false, @follow_meta_refresh_self=false, @gzip_enabled=true, @history=[], @ignore_bad_chunking=false, @keep_alive=true, @max_file_buffer=100000, @open_timeout=nil, @post_connect_hooks=[], @pre_connect_hooks=[], @read_timeout=nil, @redirect_ok=true, @redirection_limit=20, @request_headers={}, @robots=false, @user_agent="Mechanize/2.7.2 Ruby/1.9.2p290 (http://github.com/sparklemotion/mechanize/)", @webrobots=nil, @auth_store=#<Mechanize::HTTP::AuthStore:0x410da98 @auth_accounts={}, @default_auth=nil>, @authenticate_parser=#<Mechanize::HTTP::WWWAuthenticateParser:0x410da38 @scanner=nil>, @authenticate_methods={}, @digest_auth=#<Net::HTTP::DigestAuth:0x410d9d8 @mon_owner=nil, @mon_count=0, @mon_mutex=#<Mutex:0x410d9c0>, @nonce_count=-1>, @digest_challenges={}, @pass=nil, @scheme_handlers={"http"=>#<Proc:0x410d948@D:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:172 (lambda)>, "https"=>#<Proc:0x410d948@D:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:172 (lambda)>, "relative"=>#<Proc:0x410d948@D:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:172 (lambda)>, "file"=>#<Proc:0x410d948@D:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:172 (lambda)>}, @http=#<Net::HTTP::Persistent:0x410d828 @name="mechanize", @debug_output=nil, @proxy_uri=nil, @no_proxy=[], @headers={}, @override_headers={}, @http_versions={}, @keep_alive=300, @open_timeout=nil, @read_timeout=nil, @idle_timeout=5, @max_requests=nil, @socket_options=[[6, 1, 1]], @generation_key=:net_http_persistent_mechanize_generations, @ssl_generation_key=:net_http_persistent_mechanize_ssl_generations, @request_key=:net_http_persistent_mechanize_requests, @timeout_key=:net_http_persistent_mechanize_timeouts, @certificate=nil, @ca_file=nil, @private_key=nil, @ssl_version=nil, @verify_callback=nil, @verify_mode=1, @cert_store=nil, @generation=1, @ssl_generation=1, @reuse_ssl_sessions=true, @retry_change_requests=false, @ruby_1=true, @retried_on_ruby_2=false>>, @log=nil, @watch_for_set=nil, @history_added=nil, @pluggable_parser=#<Mechanize::PluggableParser:0x410d510 @parsers={"text/html"=>Mechanize::Page, "application/xhtml+xml"=>Mechanize::Page, "application/vnd.wap.xhtml+xml"=>Mechanize::Page, "image"=>Mechanize::Image, "text/xml"=>Mechanize::XmlFile, "application/xml"=>Mechanize::XmlFile}, @default=Mechanize::File>, @keep_alive_time=0, @proxy_addr=nil, @proxy_port=nil, @proxy_user=nil, @proxy_pass=nil, @html_parse
r=Nokogiri::HTML, @default_encoding=nil, @force_default_encoding=false>
irb(main):003:0> agent.user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36'
=> "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
irb(main):004:0> agent.get('http://rakuten.kobobooks.com/search/search.html?q=&t=all&f=keyword&s=publicationdatedesc&g=both&c=5wgoE9EyJkuWj2DMT8OIZg&l=ja&p=1')
=> #<Mechanize::Page
 {url
  #<URI::HTTP:0x3eba688 URL:http://rakuten.kobobooks.com/search/search.html?q=&t=all&f=keyword&s=publicationdatedesc&g=both&c=5wgoE9EyJkuWj2DMT8OIZg&l=ja&p=1>}
:
:

ときて

irb(main):013:0* puts agent.page.body
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" ><html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
:
:
        <div class="browserNotification">
            縺薙・繧ヲ繧ァ繝悶し繧、繝医・縲∫樟蝨

なんでや。。

irb(main):015:0> puts agent.page.body.encode(Encoding::Windows_31J)
Encoding::UndefinedConversionError: "\xE3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to Windows-31J
        from (irb):15:in `encode'
        from (irb):15
        from D:/RailsInstaller/Ruby1.9.2/bin/irb:12:in `<main>'
irb(main):016:0> puts agent.page.body.encode(Encoding::Windows_31J, Encoding::UTF_8)
Encoding::UndefinedConversionError: U+00BB from UTF-8 to Windows-31J
        from (irb):16:in `encode'
        from (irb):16
        from D:/RailsInstaller/Ruby1.9.2/bin/irb:12:in `<main>'

これではだめ。

irb(main):017:0> puts agent.page.body.encode(Encoding::Windows_31J, Encoding::UTF_8, undef: :replace)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" ><html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
:
:
        <div class="browserNotification">
            このウェブサイトは、現在お客様がお使いのInternet Explorer 6 あるいは

(;^ω^)やっとか

参考

Rubyのエンコーディング - @tmtms のメモ