【Python】競馬情報のスクレイピング競走馬情報の取得

2025年2月25日

はじめに

競馬のデータ分析をテーマに、競走馬の成績や血統の情報を効率的に取得します。

race_id_loader.py

ソースコード

import pickle
import re

from ..Constants.local_paths import data_dir_path
from .UI_helper import print_message


def load_race_ids(ym_list, master):
    """指定された年月リストに対応するレースIDをロード"""
    ym_dict = group_months_by_year(ym_list)
    race_id_dir = data_dir_path() / "RaceID"
    race_id_list = []
    missing_years = []

    # 年ごとにレースIDファイルを取得
    for year, months in ym_dict.items():
        race_id_path = race_id_dir / f"race_id_{year}.pkl"
        if not race_id_path.exists():
            missing_years.append(year)
            continue

        # ファイルを開いてレースIDを取得
        with open(race_id_path, "rb") as f:
            race_ids = pickle.load(f)

        # 各月に該当するレースIDを追加
        for ym in months:
            for item in race_ids:
                date = item[1]
                # 年月(yyyymm) + 日(dd)形式を確認
                if re.fullmatch(ym + r"\d{2}", date):
                    race_id_list.append(item[0])

    # 欠落年の警告表示
    if missing_years:
        print_message(
            master,
            f"以下の年のレースIDが見つかりません: {', '.join(missing_years)}。レース結果の取得・更新をしてください。",
        )

    return race_id_list


def group_months_by_year(ym_list):
    """年月リストを年ごとにグループ化（内包表記なし）"""
    ym_strings = [f"{y}{str(m).zfill(2)}" for y, m in ym_list]
    ym_dict = {}

    # 年ごとにグループ化
    for ym in ym_strings:
        year = ym[:4]
        if year not in ym_dict:
            ym_dict[year] = []
        ym_dict[year].append(ym)

    return ym_dict

`load_race_ids` 関数の概要

group_months_by_year() を使用し、年月リストを年ごとに分類。
data_dir_path() を用いてレースIDの保存ディレクトリを取得。
各年のレースIDファイルを開き、指定された年月に対応するレースIDを抽出。
print_message() を使用し、レースIDが見つからない年がある場合に警告を表示。

`group_months_by_year` 関数の概要

[yyyy, mm] の形式で与えられた年月リストを年ごとに分類。
dict を使用して、各年のリストを管理。
年月を yyyymm の文字列に変換し、年ごとにグループ化。

horse_html_fetcher.py

ソースコード

import pickle

from fake_useragent import UserAgent

from Libs.Constants.local_paths import data_dir_path
from Libs.Constants.URL_paths import UrlPaths

from .requests_helper import get_resp
from .UI_helper import print_message, update_progress


def horse_html_fetcher(race_id_list, master=None, overwrite=True):
    print_message(
        master,
        "競争馬ページのHTMLを取得しています。......　",
        line_brake=False,
    )
    horse_id_list = horse_id_loader(race_id_list, master)
    html_dir_path = data_dir_path() / "HTML" / "Horse"
    html_dir_path.mkdir(parents=True, exist_ok=True)
    if not horse_id_list:
        return
    user_agent = {"User-Agent": UserAgent().chrome}

    total = len(horse_id_list)
    for count, id in enumerate(horse_id_list, start=1):
        # 保存パスを設定
        html_path = html_dir_path / f"{id}_html.pkl"

        # 既に保存済みのHTMLがある場合はスキップ
        if not overwrite and html_path.exists():
            continue

        url = f"{UrlPaths.HORSE_URL}/{id}"
        # 競走馬ページのHTMLを取得
        resp = get_resp(url, user_agent)
        if not resp:
            continue

        # HTMLデータを pickle で保存
        with open(html_path, "wb") as f:
            pickle.dump(resp, f)

        # 進捗バーを更新
        update_progress(master, count, total)

    # 処理が終わったら進捗バーをリセット
    update_progress(master, 0, 1)
    print_message(master, "完了")

    ped_html_fetcher(horse_id_list, master)

    return horse_id_list


def ped_html_fetcher(horse_id_list, master=None):
    print_message(
        master,
        "血統ページのHTMLを取得しています。......　",
        line_brake=False,
    )
    html_dir_path = data_dir_path() / "HTML" / "Ped"
    html_dir_path.mkdir(parents=True, exist_ok=True)
    user_agent = {"User-Agent": UserAgent().chrome}

    total = len(horse_id_list)
    print(total)
    for count, id in enumerate(horse_id_list, start=1):
        # 保存パスを設定
        html_path = html_dir_path / f"{id}_html.pkl"

        # 既に保存済みのHTMLがある場合はスキップ
        if html_path.exists():
            continue

        url = f"{UrlPaths.PED_URL}/{id}"
        # 血統ページのHTMLを取得
        resp = get_resp(url, user_agent)
        if not resp:
            print(id)
            continue

        # HTMLデータを pickle で保存
        with open(html_path, "wb") as f:
            pickle.dump(resp, f)

        # 進捗バーを更新
        update_progress(master, count, total)

    # 処理が終わったら進捗バーをリセット
    update_progress(master, 0, 1)
    print_message(master, "完了")


def horse_id_loader(race_id_list, master=None):
    """
    レース結果データから馬IDを抽出して返す関数。

    :param race_id_list: レースIDのリスト
    :param master: GUIメッセージ用のオブジェクト
    :return: horse_id_list（重複なし）
    """
    # データディレクトリの指定
    raw_result_dir_path = data_dir_path() / "Raw" / "Result"
    horse_id_set = set()

    # レースIDごとに処理
    for race_id in race_id_list:
        year = race_id[:4]
        result_file_path = (
            raw_result_dir_path / f"Result_{year}" / f"{race_id}_result_df.pkl"
        )

        # ファイル存在確認
        if not result_file_path.exists():
            print_message(master, f"ファイルが見つかりません: {result_file_path}")
            continue  # 次のレースIDを処理

        # pklファイルを読み込む
        try:
            with open(result_file_path, "rb") as f:
                df = pickle.load(f)
        except Exception as e:
            print_message(master, f"ファイル読み込みエラー: {result_file_path}, {e}")
            continue

        # horse_idを抽出してセットに追加
        for horse_id in df["horse_id"].tolist():
            if isinstance(horse_id, str) and horse_id.isdigit():
                horse_id_set.add(horse_id)

    # 結果をリストで返す
    horse_id_list = list(horse_id_set)

    return horse_id_list

`horse_html_fetcher` 関数の概要

horse_id_loader() を使用して、レース結果データから競走馬のIDを抽出。
data_dir_path() を用いてHTMLデータの保存ディレクトリを作成。
requests_helper.get_resp() を利用し、競走馬のHTMLページを取得。
pickle を使用して取得したデータを保存。
update_progress() で進捗を管理し、ユーザーに通知。

`ped_html_fetcher` 関数の概要

horse_html_fetcher() によって収集された競走馬IDリストを使用。
指定された競走馬の血統ページ (Ped_URL) のHTMLデータを取得。
データを pickle 形式で保存。
update_progress() を利用して進捗を管理。

`horse_id_loader` 関数の概要

レース結果データ (Result) から競走馬のIDを抽出。
既存のデータから horse_id を取得し、リスト化。
pickle で保存されたレース結果データをロードし、馬IDを抽出。

horse_parser.py

ソースコード

import pickle

from fake_useragent import UserAgent

from Libs.Constants.local_paths import data_dir_path
from Libs.Constants.URL_paths import UrlPaths

from .requests_helper import get_resp
from .UI_helper import print_message, update_progress


def horse_html_fetcher(race_id_list, master=None, overwrite=True):
    print_message(
        master,
        "競争馬ページのHTMLを取得しています。......　",
        line_brake=False,
    )
    horse_id_list = horse_id_loader(race_id_list, master)
    html_dir_path = data_dir_path() / "HTML" / "Horse"
    html_dir_path.mkdir(parents=True, exist_ok=True)
    if not horse_id_list:
        return
    user_agent = {"User-Agent": UserAgent().chrome}

    total = len(horse_id_list)
    for count, id in enumerate(horse_id_list, start=1):
        # 保存パスを設定
        html_path = html_dir_path / f"{id}_html.pkl"

        # 既に保存済みのHTMLがある場合はスキップ
        if not overwrite and html_path.exists():
            continue

        url = f"{UrlPaths.HORSE_URL}/{id}"
        # 競走馬ページのHTMLを取得
        resp = get_resp(url, user_agent)
        if not resp:
            continue

        # HTMLデータを pickle で保存
        with open(html_path, "wb") as f:
            pickle.dump(resp, f)

        # 進捗バーを更新
        update_progress(master, count, total)

    # 処理が終わったら進捗バーをリセット
    update_progress(master, 0, 1)
    print_message(master, "完了")

    ped_html_fetcher(horse_id_list, master)

    return horse_id_list


def ped_html_fetcher(horse_id_list, master=None):
    print_message(
        master,
        "血統ページのHTMLを取得しています。......　",
        line_brake=False,
    )
    html_dir_path = data_dir_path() / "HTML" / "Ped"
    html_dir_path.mkdir(parents=True, exist_ok=True)
    user_agent = {"User-Agent": UserAgent().chrome}

    total = len(horse_id_list)
    print(total)
    for count, id in enumerate(horse_id_list, start=1):
        # 保存パスを設定
        html_path = html_dir_path / f"{id}_html.pkl"

        # 既に保存済みのHTMLがある場合はスキップ
        if html_path.exists():
            continue

        url = f"{UrlPaths.PED_URL}/{id}"
        # 血統ページのHTMLを取得
        resp = get_resp(url, user_agent)
        if not resp:
            print(id)
            continue

        # HTMLデータを pickle で保存
        with open(html_path, "wb") as f:
            pickle.dump(resp, f)

        # 進捗バーを更新
        update_progress(master, count, total)

    # 処理が終わったら進捗バーをリセット
    update_progress(master, 0, 1)
    print_message(master, "完了")


def horse_id_loader(race_id_list, master=None):
    """
    レース結果データから馬IDを抽出して返す関数。

    :param race_id_list: レースIDのリスト
    :param master: GUIメッセージ用のオブジェクト
    :return: horse_id_list（重複なし）
    """
    # データディレクトリの指定
    raw_result_dir_path = data_dir_path() / "Raw" / "Result"
    horse_id_set = set()

    # レースIDごとに処理
    for race_id in race_id_list:
        year = race_id[:4]
        result_file_path = (
            raw_result_dir_path / f"Result_{year}" / f"{race_id}_result_df.pkl"
        )

        # ファイル存在確認
        if not result_file_path.exists():
            print_message(master, f"ファイルが見つかりません: {result_file_path}")
            continue  # 次のレースIDを処理

        # pklファイルを読み込む
        try:
            with open(result_file_path, "rb") as f:
                df = pickle.load(f)
        except Exception as e:
            print_message(master, f"ファイル読み込みエラー: {result_file_path}, {e}")
            continue

        # horse_idを抽出してセットに追加
        for horse_id in df["horse_id"].tolist():
            if isinstance(horse_id, str) and horse_id.isdigit():
                horse_id_set.add(horse_id)

    # 結果をリストで返す
    horse_id_list = list(horse_id_set)

    return horse_id_list

`horse_parser` 関数の概要

競走馬の戦績ページのHTMLデータをロード。
BeautifulSoup を用いてHTMLを解析。
pandas.read_html() を使用し、テーブルデータを DataFrame に変換。
取得したデータを pickle で保存し、進捗状況を update_progress() で管理。

`ped_parser` 関数の概要

horse_parser() によって収集された競走馬IDリストを使用。
指定された競走馬の血統ページ (Ped_URL) のHTMLデータをロード。
BeautifulSoup を用いてHTMLを解析し、血統データを DataFrame に変換。
取得したデータを pickle 形式で保存し、進捗を管理。

よかったらシェアしてね！

URLをコピーしました！

URLをコピーしました！

この記事を書いた人

Katsuma

【Python】競馬情報のスクレイピング競走馬情報の取得

はじめに

race_id_loader.py

ソースコード

`load_race_ids` 関数の概要

`group_months_by_year` 関数の概要

horse_html_fetcher.py

ソースコード

`horse_html_fetcher` 関数の概要

`ped_html_fetcher` 関数の概要

`horse_id_loader` 関数の概要

horse_parser.py

ソースコード

`horse_parser` 関数の概要

`ped_parser` 関数の概要

この記事を書いた人

コメント

コメントするコメントをキャンセル

【Python】競馬情報のスクレイピング 競走馬情報の取得

はじめに

race_id_loader.py

ソースコード

load_race_ids 関数の概要

group_months_by_year 関数の概要

horse_html_fetcher.py

ソースコード

horse_html_fetcher 関数の概要

ped_html_fetcher 関数の概要

horse_id_loader 関数の概要

horse_parser.py

ソースコード

horse_parser 関数の概要

ped_parser 関数の概要

この記事を書いた人

関連記事

コメント

コメントする コメントをキャンセル

【Python】競馬情報のスクレイピング競走馬情報の取得

`load_race_ids` 関数の概要

`group_months_by_year` 関数の概要

`horse_html_fetcher` 関数の概要

`ped_html_fetcher` 関数の概要

`horse_id_loader` 関数の概要

`horse_parser` 関数の概要

`ped_parser` 関数の概要

コメントするコメントをキャンセル