ルモーリン

Perlサンプル22 サイトの情報を取得~その2~

投稿:2020-03-08

このツイートを見て作ってみました。

NHKのサイトからのど自慢の曲目リストを取得してCSVファイルに保存します。 CSVファイルの文字コードはutf8です。 エクセルで扱う場合は開かずに(「ファイルを開く」でなく)、テキスト読み込み(「データ|外部データの取り込み|テキストファイル」を使用)してください。

#!/usr/bin/env perl

use v5.26;
use utf8;
use warnings;
use strict;

use feature "say";
use open IO => ":utf8";

use DateTime::Format::Strptime;
use Encode::Locale;
use HTML::TagParser;
use Text::CSV qw/ csv /;
use WWW::Mechanize;

binmode STDOUT, ":encoding(console_out)";
binmode STDERR, ":encoding(console_out)";

$| = 1;

my $mc = WWW::Mechanize->new;

use constant URL_BASE => "http://www6.nhk.or.jp";
use constant CSV_OUT => "perlsample_022.csv";
use constant HEADERS => [qw / 放送日 地域 会場 曲目 歌手 /];

say "トップページを開く";
$mc->get(URL_BASE . "/nodojiman");
# 正常に開いたはず
$mc->success or die;
say "URL:" . $mc->uri;
say "タイトル:" . $mc->title;

say "これまでの放送をクリック";
$mc->follow_link(text => "これまでの放送");
# 正常に開いたはず
$mc->success or die;
say "URL:" . $mc->uri;
say "タイトル:" . $mc->title;

# 過去の放送一覧
my $tag = HTML::TagParser->new($mc->content);

# 「class="listEach clearfix"」と書かれたタグすべてを取得
my @report = $tag->getElementsByClassName("listEach clearfix");
say "放送:@{[scalar @report]}回";

my $strp = DateTime::Format::Strptime->new(
	pattern => "%Y年%m月%d日",
	time_zone => "local",
);

my @program;
for (@report) {
	my $tag_dt = $_->firstChild;
	my $dt = $strp->parse_datetime($tag_dt->innerText);
	my $tag_dd = $tag_dt->nextSibling;
	my ($firstline) = split /\n/, $tag_dd->innerText;
	my ($place, $hall) = split /\s/, $firstline;
	my $href = $tag_dd->firstChild->firstChild->firstChild->getAttribute("href");

	push @program, {
		dt => $dt,
		place => $place,
		hall => $hall,
		href => $href,
	};
}

print "放送毎の曲目取得:";
for (@program) {
	print ".";
	$mc->get(URL_BASE . $_->{href});
	$mc->success or die;

	$tag = HTML::TagParser->new($mc->content);
	my @tag_tr = $tag->getElementsByTagName("tr");
	my @song_list;
	for (@tag_tr) {
		my $tag_td = $_->firstChild;
		my $title = $tag_td->innerText;
		$title =~ s/\r\n//g;
		$tag_td = $tag_td->nextSibling;
		my $singer = $tag_td->innerText;
		$singer =~ s/\r\n//g;
		push @song_list, {
			title => $title,
			singer => $singer,
		};
	}

	$_->{song_list} = \@song_list;
}
print "\n";

my $csv;
for my $program (sort { DateTime::compare($a->{dt}, $b->{dt}) } @program) {
	push @{$csv}, {
		放送日 => $program->{dt}->strftime('%F'),
		地域 => $program->{place},
		会場 => $program->{hall},
		曲目 => $_->{title},
		歌手 => $_->{singer},
	} for @{$program->{song_list}};
}

say "CSV書き出し";
$\ = "\012";
csv(
	encoding => "utf8",
	headers => HEADERS,
	eol => "\n",
	in => $csv,
	out => CSV_OUT,
);
トップページを開く
URL:http://www6.nhk.or.jp/nodojiman/
タイトル:NHKのど自慢|NHK 総合テレビ・ラジオ第1
これまでの放送をクリック
URL:http://www6.nhk.or.jp/nodojiman/list/index.html
タイトル:これまでの放送|NHKのど自慢
放送:175回
放送毎の曲目取得:...............................................................................................................................................................................
CSV書き出し

行数が多いので途中を省略しました。

"放送日","地域","会場","曲目","歌手"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","サウスポー","ピンク・レディー"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","それが大事","大事MANブラザーズバンド"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","Let It Go~ありのままで~(Heartfull Ver.)","May J."
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","愛しのテキーロ","氷川きよし"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","花は咲く","花は咲くプロジェクト"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","ギンギラギンにさりげなく","近藤真彦"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","365日の紙飛行機",AKB48
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","花","中孝介"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","いつでも夢を","橋幸夫吉永小百合"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","ふるさと","嵐"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","生きてこそ","May J."
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","Rising Sun",EXILE
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","白雲の城","氷川きよし"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","峠越え","福田こうへい"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","ヒカレ","ゆず"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","飛び方を忘れた小さな鳥",MISIA
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","人恋酒場","三山ひろし"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","夜明け","天童よしみ"
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館",Story,AI
2016-04-03,"鹿児島県日置市","日置市伊集院文化会館","涙そうそう","夏川りみ"
2016-04-10,"岩手県久慈市","久慈市文化会館","一番星","一番星プロジェクト"
2016-04-10,"岩手県久慈市","久慈市文化会館","桜木町","ゆず"
2016-04-10,"岩手県久慈市","久慈市文化会館","春一番","キャンディーズ"
(中略)
2020-02-16,"岐阜県可児市","可児市文化創造センター","for you...","髙橋真梨子"
2020-02-16,"岐阜県可児市","可児市文化創造センター","もしもピアノが弾けたなら","西田敏行"
2020-02-16,"岐阜県可児市","可児市文化創造センター","手をつなごう ※チャンピオン※","絢香"
2020-02-23,"長野県辰野町","辰野町民会館","君は薔薇より美しい","布施明"
2020-02-23,"長野県辰野町","辰野町民会館","喝采","ちあきなおみ"
2020-02-23,"長野県辰野町","辰野町民会館",MARIONETTE,BOФWY
2020-02-23,"長野県辰野町","辰野町民会館","手紙","由紀さおり"
2020-02-23,"長野県辰野町","辰野町民会館","古い日記 ※特別賞※","和田アキ子"
2020-02-23,"長野県辰野町","辰野町民会館","これから","平原綾香"
2020-02-23,"長野県辰野町","辰野町民会館","心のこり","細川たかし"
2020-02-23,"長野県辰野町","辰野町民会館","恋のバカンス","ザ・ピーナッツ"
2020-02-23,"長野県辰野町","辰野町民会館",糸,"中島みゆき"
2020-02-23,"長野県辰野町","辰野町民会館","月がとっても青いから","菅原都々子"
2020-02-23,"長野県辰野町","辰野町民会館","限界突破×サバイバー","氷川きよし"
2020-02-23,"長野県辰野町","辰野町民会館","ハナミズキ","一青 窈"
2020-02-23,"長野県辰野町","辰野町民会館","高校三年生","舟木一夫"
2020-02-23,"長野県辰野町","辰野町民会館",Story,AI
2020-02-23,"長野県辰野町","辰野町民会館","だから僕は音楽を辞めた ※チャンピオン※","ヨルシカ"
2020-02-23,"長野県辰野町","辰野町民会館","勘太郎月夜唄","小畑実"
2020-02-23,"長野県辰野町","辰野町民会館","ひまわりの約束","秦基博"
2020-02-23,"長野県辰野町","辰野町民会館","悲しき口笛","美空ひばり"
2020-02-23,"長野県辰野町","辰野町民会館","酒と泪と男と女","河島英五"
2020-02-23,"長野県辰野町","辰野町民会館","いのちの歌","茉奈佳奈"