Qt 数字报阅读器 图文版

13 篇文章 0 订阅

Qt数字报阅读器图文版,由于先前PDF版本的阅读器仅显示PDF或JPG大图,可以满足对数字报进行粗略阅读的需求,但不便详细查看新闻和对新闻进行检索,同时如果数字报不存在PDF或JPG大图,则不能进行收录,此图文版阅读器对数字报图文形式新闻进行收录。在比较Solr和Elasticsearch的前提下,决定使用Elasticsearch进行新闻的存储和检索。由于树莓派3B+只有1G内存,无法部署Elasticsearch,改用MySQL,新闻检索改为对引题、主题、副题、图片描述、内容进行全文检索。

由于树莓派上使用apt-get安装的MySQL为MariaDB,不包括Mroonga存储引擎,而MariaDB需要使用Mroonga才能进行全文检索(仅对于中文检索而言),故删除原先apt-get安装的MariaDB,继续使用MariaDB,不再使用MySQL,重新源码编译MariaDB,Mroonga自10.0.15版本后默认可用,只需源码编译后启用即可,同时编译时不需加额外命令,本文使用的是10.1.41版本。

编译完成后指定datadir新目录,新目录的上层目录读写权限必须为755,修改新目录所有者为mysql:mysql。修改datadir过后,每次开机后MariaDB不会正常运行,需手动sudo start再运行一次才行,后文有解决办法。同时记录一个问题,my.cnf中mysql.sock不可修改目录,由于datadir新目录位于挂载的硬盘,打算把mysql.sock也放于datadir新目录下,不知是mysql.sock文件不能和datadir同目录还是由于挂载硬盘原因,重新启动MariaDB后无法成功,/media/pi下原先的挂载目录中,比如nas,只有mysql的空文件夹,nas1里的文件才是原先的文件,但是没有权限访问,root权限手动删除nas并mv nas1 nas,提示设备busy,这时只需修改mysql.sock为默认目录,重启树莓派即可。

接下来需要增加swap分区,防止出现数据库内存不足错误,但是树莓派swap分区只能位于/var/swap,而且在TF卡上增加swap分区会缩短TF卡寿命,此种做法可参考以下第三篇文章。本文使用zram,参考以下第一篇文章的zram超频部分,不同的是如果手动加载模块并执行zram.sh,只能成功加上zram0,因默认设备数为1,可在加载模块时sudo modprobe zram num_devices=4,zram.sh会自动发现cpu核心数,故建议加载模块并下载zram.sh后,直接写入rc.local,重启树莓派。同时记录一个问题,树莓派刚用的时候,内存为927MB,现在内存只有874MB,和通过zram增加的swap分区同样大小。

rc.local如下图

my_start.sh如下图

树莓派超频全攻略 – 树莓派中文站

http://www.52pi.net/archives/1384

zram 简介 - 半月旋空 - CSDN博客

https://blog.csdn.net/longwang155069/article/details/51900031

修改树莓派交换分区 SWAP 的正确姿势 | 树莓派实验室

http://shumeipai.nxez.com/2017/12/18/how-to-modify-raspberry-pi-swap-partition.html

浅谈MySQL和MariaDB区别? - 咻一咻的博客 - CSDN博客

https://blog.csdn.net/qq_37187976/article/details/79117863

Downloads - MariaDB

https://downloads.mariadb.org/

Mariadb修改root密码 - KeithTt - 博客园

https://www.cnblogs.com/keithtt/p/6922378.html

cmake . -DCMAKE_BUILD_TYPE=Release

make

sudo make install

cd /usr/local/mysql/scripts

sudo ./mysql_install_db --user=mysql --basedir=/usr/local/mysql/  --datadir=/media/pi/nas/mysql

sudo cp /usr/local/mysql/support-files/mysql.server /etc/init.d/mariadb

sudo chmod +x /etc/init.d/mariadb

sudo systemctl enable mariadb

sudo vim /etc/init.d/mariadb添加basedir、datadir、conf

sudo cp /usr/local/mysql/support-files/my-huge.cnf /etc/my.cnf

sudo vim /etc/my.cnf修改datadir到/media/pi/nas/mysql,[client]下添加default-character-set=utf8mb4,[mysqld]下添加character-set-server=utf8mb4

sudo mysql -u root

use mysql;

select host, user from user;把除了’root’@’127.0.0.1’的记录都删掉

update user set password=PASSWORD('YourPasswordHere') where user='root';

update user set host='%' where user='root';

flush privileges; 

sudo service mariadb restart

启用Mroonga存储引擎,按照下面的文章进行即可。

MariaDB10.2.6启用Mroonga存储引擎用于全文索引-运维人生-51CTO博客

https://blog.51cto.com/jinyan2049/1942333

show engines;

INSTALL SONAME 'ha_mroonga';

CREATE FUNCTION last_insert_grn_id RETURNS INTEGER SONAME 'ha_mroonga.so';

show engines;

MariaDB/MySQL全文检索的介绍、语法及各场景的检索实例可参考以下四篇文章。

Full-Text Indexes - MariaDB Knowledge Base

https://mariadb.com/kb/en/library/full-text-indexes/

Mroonga - MariaDB Knowledge Base

https://mariadb.com/kb/en/library/mroonga/

MySQL中文全文检索demoSQL - 马丁传奇 - 博客园

https://www.cnblogs.com/martinzhang/p/3220345.html

MySQL全文检索fulltext和中日韩文解析插件ngram使用笔记 - 蛙鳜鸡鹳狸猿 - CSDN博客

https://blog.csdn.net/sweeper_freedoman/article/details/82847754

mainwindow.cpp

#include "mainwindow.h"
#include "ui_mainwindow.h"
#include "readepaperwidget.h"
#include <QDir>
#include <QCollator>
#include <QLocale>
#include <QCheckBox>
#include <QHBoxLayout>
#include <QTimer>
#include <QMessageBox>
#include <QTreeWidgetItem>
#include <QTableWidgetItem>
#include <QApplication>
#include <QProgressDialog>

MainWindow::MainWindow(QWidget *parent) :
    QMainWindow(parent),
    ui(new Ui::MainWindow)
{
    ui->setupUi(this);
    QFont font;
    font.setPixelSize(16);
    setFont(font);
    setWindowTitle(QStringLiteral("数字报阅读器 - 图文版"));

    ui->dateEdit_read_start->setCalendarPopup(true);
    ui->dateEdit_read_end->setCalendarPopup(true);
    ui->dateEdit_read_start->setDate(QDate::currentDate());
    ui->dateEdit_read_end->setDate(QDate::currentDate());

    ui->dateEdit_search_start->setCalendarPopup(true);
    ui->dateEdit_search_end->setCalendarPopup(true);
    ui->dateEdit_search_start->setDate(QDate::currentDate());
    ui->dateEdit_search_end->setDate(QDate::currentDate());

    QStringList headers;
    headers << QStringLiteral("名称") << QStringLiteral("日期") << QStringLiteral("版面") << QStringLiteral("主题");
    ui->treeWidget_read_result->setColumnCount(headers.size());
    ui->treeWidget_read_result->setHeaderLabels(headers);
    connect(ui->treeWidget_read_result, SIGNAL(itemClicked(QTreeWidgetItem*,int)), this, SLOT(readEpaper(QTreeWidgetItem*,int)));

    for (int i = 0; i < headers.size(); ++i)
    {
        ui->treeWidget_read_result->headerItem()->setTextAlignment(i, Qt::AlignHCenter | Qt::AlignVCenter);
    }

    ui->lineEdit_keyword->setPlaceholderText(QStringLiteral("多个关键词用空格分开"));
    ui->comboBox_page_size->setCurrentText("50");

    ui->tableWidget_search_result->setEditTriggers(QAbstractItemView::NoEditTriggers);
    ui->tableWidget_search_result->setSelectionBehavior(QAbstractItemView::SelectRows);
    ui->tableWidget_search_result->setSelectionMode(QAbstractItemView::SingleSelection);
    ui->tableWidget_search_result->verticalHeader()->setVisible(false);
    ui->tableWidget_search_result->horizontalHeader()->setHighlightSections(false);
    ui->tableWidget_search_result->horizontalHeader()->setStretchLastSection(true);

    headers.clear();
    headers << QStringLiteral("名称") << QStringLiteral("日期") << QStringLiteral("版面") << QStringLiteral("引题") << QStringLiteral("主题") << QStringLiteral("副题") << QStringLiteral("作者");
    ui->tableWidget_search_result->setColumnCount(headers.size());
    ui->tableWidget_search_result->setHorizontalHeaderLabels(headers);

    for (int i = 0; i < headers.size(); ++i)
    {
        ui->tableWidget_search_result->horizontalHeader()->setSectionResizeMode(i, QHeaderView::ResizeToContents);
    }

    connect(ui->tableWidget_search_result, SIGNAL(doubleClicked(QModelIndex)), this, SLOT(readEpaperSearch(QModelIndex)));

    QTimer::singleShot(0, this, SLOT(initCheckBox()));
}

MainWindow::~MainWindow()
{
    delete ui;
}

void MainWindow::getEpaperName()
{
    if (!mEpaperNameLst.isEmpty())
    {
        return;
    }

    QDir dir("Z:\\");
    if (!dir.exists() || !dir.isReadable())
    {
        return;
    }

    dir.setFilter(QDir::Dirs | QDir::NoSymLinks | QDir::NoDotAndDotDot);
    dir.setSorting(QDir::Name);
    QFileInfoList fileInfoLst = dir.entryInfoList();
    if (fileInfoLst.isEmpty())
    {
        return;
    }

    foreach (QFileInfo fileInfo, fileInfoLst)
    {
        QString name = fileInfo.fileName();
        mEpaperNameLst.append(name);
    }

    QLocale cn(QLocale::Chinese);
    QCollator collator(cn);
    std::sort(mEpaperNameLst.begin(), mEpaperNameLst.end(), collator);
}

void MainWindow::initReadCheckBox()
{
    getEpaperName();

    foreach (QString epaperName, mEpaperNameLst)
    {
        QCheckBox *checkBox = new QCheckBox(QStringLiteral("%1").arg(epaperName));
        connect(checkBox, SIGNAL(toggled(bool)), this, SLOT(showCheckBoxRead(bool)));
        mCheckBoxLstRead.append(checkBox);
    }

    QVBoxLayout *mainLayout = new QVBoxLayout;
    QWidget *widget = new QWidget;

    mSelectAllCheckBoxRead = new QCheckBox(QStringLiteral("全选(未选择)"));
    QHBoxLayout *layout = new QHBoxLayout;
    layout->addWidget(mSelectAllCheckBoxRead);
    mainLayout->addLayout(layout);

    connect(mSelectAllCheckBoxRead, SIGNAL(toggled(bool)), this, SLOT(selectAllRead(bool)));

    int size = mEpaperNameLst.size();
    int column = 4;
    int row = size / column + 1;
    for (int i = 1; i <= row; ++i)
    {
        QHBoxLayout *layout = new QHBoxLayout;
        int start = (i - 1) * column;
        int end = i * column - 1;
        for (int j = start; j <= end; ++j)
        {
            if (j < size)
            {
                layout->addWidget(mCheckBoxLstRead.at(j));
            }
        }
        mainLayout->addLayout(layout);
    }

    widget->setLayout(mainLayout);
    ui->scrollArea->setFrameShape(QFrame::NoFrame);
    ui->scrollArea->setWidget(widget);
}

void MainWindow::initSearchCheckBox()
{
    getEpaperName();

    foreach (QString epaperName, mEpaperNameLst)
    {
        QCheckBox *checkBox = new QCheckBox(QStringLiteral("%1").arg(epaperName));
        connect(checkBox, SIGNAL(toggled(bool)), this, SLOT(showCheckBoxSearch(bool)));
        mCheckBoxLstSearch.append(checkBox);
    }

    QVBoxLayout *mainLayout = new QVBoxLayout;
    QWidget *widget = new QWidget;

    mSelectAllCheckBoxSearch = new QCheckBox(QStringLiteral("全选(未选择)"));
    QHBoxLayout *layout = new QHBoxLayout;
    layout->addWidget(mSelectAllCheckBoxSearch);
    mainLayout->addLayout(layout);

    connect(mSelectAllCheckBoxSearch, SIGNAL(toggled(bool)), this, SLOT(selectAllSearch(bool)));

    int size = mEpaperNameLst.size();
    int column = 4;
    int row = size / column + 1;
    for (int i = 1; i <= row; ++i)
    {
        QHBoxLayout *layout = new QHBoxLayout;
        int start = (i - 1) * column;
        int end = i * column - 1;
        for (int j = start; j <= end; ++j)
        {
            if (j < size)
            {
                layout->addWidget(mCheckBoxLstSearch.at(j));
            }
        }
        mainLayout->addLayout(layout);
    }

    widget->setLayout(mainLayout);
    ui->scrollArea_2->setFrameShape(QFrame::NoFrame);
    ui->scrollArea_2->setWidget(widget);
}

bool MainWindow::informationMessageBox(const QString& title, const QString& text, bool isOnlyOk)
{
    QMessageBox msgBox(this);
    msgBox.setFont(this->font());
    msgBox.setIcon(QMessageBox::Information);
    msgBox.setWindowTitle(title);
    msgBox.setText(text);
    if (isOnlyOk)
    {
        msgBox.setStandardButtons(QMessageBox::Ok);
        msgBox.setButtonText(QMessageBox::Ok, QStringLiteral("确定"));
    }
    else
    {
        msgBox.setStandardButtons(QMessageBox::Ok | QMessageBox::Cancel);
        msgBox.setButtonText(QMessageBox::Ok, QStringLiteral("确定"));
        msgBox.setButtonText(QMessageBox::Cancel, QStringLiteral("取消"));
    }

    return (msgBox.exec() == QMessageBox::Ok);
}

void MainWindow::showSearchData()
{
    int rowCount = ui->tableWidget_search_result->rowCount();
    for (int i = rowCount; i > 0; --i)
    {
        ui->tableWidget_search_result->removeRow(0);
    }

    if (mSearchDataLst.isEmpty())
    {
        return;
    }

    mTotalCount = mSearchDataLst.size();
    mTotalPage = mTotalCount / mPageSize + 1;

    ui->label_page_tip->setText(QStringLiteral("共%1条结果,第%2页,共%3页")
                                .arg(mTotalCount).arg(mCurrentPage).arg(mTotalPage));

    QList<QStringList> tmpLst;
    if (mCurrentPage == mTotalPage)
    {
        for (int i = (mCurrentPage - 1) * mPageSize; i < mTotalCount; ++i)
        {
            tmpLst.append(mSearchDataLst.at(i));
        }
    }
    else
    {
        for (int i = (mCurrentPage - 1) * mPageSize; i <= (mCurrentPage * mPageSize - 1); ++i)
        {
            tmpLst.append(mSearchDataLst.at(i));
        }
    }

    foreach (QStringList searchDataLst, tmpLst)
    {
        int rowCount = ui->tableWidget_search_result->rowCount();
        ui->tableWidget_search_result->insertRow(rowCount);

        QTableWidgetItem *itemName = new QTableWidgetItem(QStringLiteral("%1")
                                                          .arg(searchDataLst[0]));
        QTableWidgetItem *itemDate = new QTableWidgetItem(QStringLiteral("%1")
                                                          .arg(searchDataLst[1]));
        QTableWidgetItem *itemLayout = new QTableWidgetItem(QStringLiteral("%1")
                                                            .arg(searchDataLst[2]));
        QTableWidgetItem *itemPreTitle = new QTableWidgetItem(QStringLiteral("%1")
                                                              .arg(searchDataLst[3]));
        QTableWidgetItem *itemTitle = new QTableWidgetItem(QStringLiteral("%1")
                                                           .arg(searchDataLst[4]));
        QTableWidgetItem *itemSubTitle = new QTableWidgetItem(QStringLiteral("%1")
                                                              .arg(searchDataLst[5]));
        QTableWidgetItem *itemAuthor = new QTableWidgetItem(QStringLiteral("%1")
                                                            .arg(searchDataLst[6]));

        ui->tableWidget_search_result->setItem(rowCount, 0, itemName);
        ui->tableWidget_search_result->setItem(rowCount, 1, itemDate);
        ui->tableWidget_search_result->setItem(rowCount, 2, itemLayout);
        ui->tableWidget_search_result->setItem(rowCount, 3, itemPreTitle);
        ui->tableWidget_search_result->setItem(rowCount, 4, itemTitle);
        ui->tableWidget_search_result->setItem(rowCount, 5, itemSubTitle);
        ui->tableWidget_search_result->setItem(rowCount, 6, itemAuthor);
    }
}

void MainWindow::initCheckBox()
{
    initReadCheckBox();
    initSearchCheckBox();
}

void MainWindow::selectAllRead(bool ok)
{
    if (ok)
    {
        foreach (QCheckBox *checkBox, mCheckBoxLstRead)
        {
            checkBox->setChecked(true);
        }
    }
    else
    {
        foreach (QCheckBox *checkBox, mCheckBoxLstRead)
        {
            checkBox->setChecked(false);
        }
    }
}

void MainWindow::selectAllSearch(bool ok)
{
    if (ok)
    {
        foreach (QCheckBox *checkBox, mCheckBoxLstSearch)
        {
            checkBox->setChecked(true);
        }
    }
    else
    {
        foreach (QCheckBox *checkBox, mCheckBoxLstSearch)
        {
            checkBox->setChecked(false);
        }
    }
}

void MainWindow::showCheckBoxRead(bool ok)
{
    Q_UNUSED(ok);

    int count = 0;
    foreach (QCheckBox *checkBox, mCheckBoxLstRead)
    {
        if (checkBox->isChecked())
        {
            count += 1;
        }
    }

    if (count == 0)
    {
        mSelectAllCheckBoxRead->setText(QStringLiteral("全选(未选择)"));
    }
    else
    {
        mSelectAllCheckBoxRead->setText(QStringLiteral("全选(已选择%1个)").arg(count));
    }
}

void MainWindow::showCheckBoxSearch(bool ok)
{
    Q_UNUSED(ok);

    int count = 0;
    foreach (QCheckBox *checkBox, mCheckBoxLstSearch)
    {
        if (checkBox->isChecked())
        {
            count += 1;
        }
    }

    if (count == 0)
    {
        mSelectAllCheckBoxSearch->setText(QStringLiteral("全选(未选择)"));
    }
    else
    {
        mSelectAllCheckBoxSearch->setText(QStringLiteral("全选(已选择%1个)").arg(count));
    }
}

void MainWindow::on_read_pushButton_clicked()
{
    if (ui->dateEdit_read_start->date() > ui->dateEdit_read_end->date())
    {
        return;
    }

    QStringList paperNameLst;
    foreach (QCheckBox *checkBox, mCheckBoxLstRead)
    {
        if (checkBox->isChecked())
        {
            paperNameLst.append(checkBox->text());
        }
    }

    if (paperNameLst.isEmpty())
    {
        return;
    }

    QString retStr = mDBHelper.getConnectDB();

    if (!retStr.isEmpty())
    {
        informationMessageBox(QStringLiteral("提示"), QStringLiteral("数据库连接失败:\n%1").arg(retStr));
        return;
    }

    ui->read_pushButton->setEnabled(false);
    ui->treeWidget_read_result->clear();

    QStringList paperDateLst;
    QDate startDate = ui->dateEdit_read_start->date();
    QDate endDate = ui->dateEdit_read_end->date();
    while (startDate <= endDate)
    {
        paperDateLst.append(startDate.toString("yyyy-MM-dd"));
        startDate = startDate.addDays(1);
    }

    QProgressDialog progress(this);
    progress.setFont(this->font());
    progress.setWindowTitle(QStringLiteral("数字报阅读器 - 图文版"));
    progress.setWindowFlags(windowFlags() & (~Qt::WindowContextHelpButtonHint) & (~Qt::WindowMinMaxButtonsHint) & (~Qt::WindowCloseButtonHint));
    progress.setLabelText(QStringLiteral("处理中..."));
    progress.setRange(0, paperDateLst.size() * paperNameLst.size());
    progress.setModal(true);
    progress.setCancelButtonText(QStringLiteral("取消"));
    progress.setMinimumDuration(0);
    connect(&progress, SIGNAL(canceled()), this, SLOT(progressCanceled()));
    int count = 1;

    QStringList columnLst;
    columnLst << "paper_layout" << "title";

    foreach (QString paperName, paperNameLst)
    {
        foreach (QString paperDate, paperDateLst)
        {
            qApp->processEvents(QEventLoop::ExcludeUserInputEvents);

            QString str = QStringLiteral("select columns from t_epaper where paper_date = '%1' and paper_name = '%2' order by seq_num;").arg(paperDate).arg(paperName);

            QTreeWidgetItem *name = new QTreeWidgetItem(ui->treeWidget_read_result);
            name->setText(0, paperName);
            ui->treeWidget_read_result->addTopLevelItem(name);

            QTreeWidgetItem *date = new QTreeWidgetItem(name);
            date->setText(1, paperDate);

            QStringList layoutLst;
            QList<QStringList> retLst = mDBHelper.getSqlSelect(str, columnLst);
            foreach (QStringList ret, retLst)
            {
                if (!layoutLst.contains(ret[0]))
                {
                    layoutLst.append(ret[0]);
                }
            }

            foreach (QString layout, layoutLst)
            {
                QTreeWidgetItem *paperLayout = new QTreeWidgetItem(date);
                paperLayout->setText(2, layout);

                foreach (QStringList ret, retLst)
                {
                    if (ret[0] == layout)
                    {
                        QTreeWidgetItem *paperTitle = new QTreeWidgetItem(paperLayout);
                        paperTitle->setText(3, ret[1]);
                    }
                }
            }

            progress.setValue(count++);
        }
    }

    ui->read_pushButton->setEnabled(true);
}

void MainWindow::on_search_pushButton_clicked()
{
    if (ui->lineEdit_keyword->text().isEmpty())
    {
        return;
    }

    if (ui->dateEdit_search_start->date() > ui->dateEdit_search_end->date())
    {
        return;
    }

    QStringList paperNameLst;
    foreach (QCheckBox *checkBox, mCheckBoxLstSearch)
    {
        if (checkBox->isChecked())
        {
            paperNameLst.append(checkBox->text());
        }
    }

    if (paperNameLst.isEmpty())
    {
        return;
    }

    QString retStr = mDBHelper.getConnectDB();

    if (!retStr.isEmpty())
    {
        informationMessageBox(QStringLiteral("提示"), QStringLiteral("数据库连接失败:\n%1").arg(retStr));
        return;
    }

    ui->search_pushButton->setEnabled(false);

    QStringList paperDateLst;
    QDate startDate = ui->dateEdit_search_start->date();
    QDate endDate = ui->dateEdit_search_end->date();
    while (startDate <= endDate)
    {
        paperDateLst.append(startDate.toString("yyyy-MM-dd"));
        startDate = startDate.addDays(1);
    }

    QString keyWord = ui->lineEdit_keyword->text();
    QString keyWordRelationship = ui->comboBox_keyword_relation->currentText();
    QString searchRange = ui->comboBox_search_range->currentText();

    QString against;
    if (keyWordRelationship == QStringLiteral("全词"))
    {
        against = QStringLiteral("'\"%1\"' IN BOOLEAN MODE").arg(keyWord);
    }
    else if (keyWordRelationship == QStringLiteral("并且"))
    {
        QStringList wordLst = keyWord.split(" ");
        against = "'";
        foreach (QString word, wordLst)
        {
            if (!word.isEmpty())
            {
                against += QStringLiteral("+%1 ").arg(word);
            }
        }
        against.remove(against.lastIndexOf(" "), 1);
        against += "' IN BOOLEAN MODE";
    }
    else if (keyWordRelationship == QStringLiteral("或者"))
    {
        QStringList wordLst = keyWord.split(" ");
        against = "'";
        foreach (QString word, wordLst)
        {
            if (!word.isEmpty())
            {
                against += QStringLiteral("%1 ").arg(word);
            }
        }
        against.remove(against.lastIndexOf(" "), 1);
        against += "' IN BOOLEAN MODE";
    }

    QString match;
    if (searchRange == QStringLiteral("全部"))
    {
        match = "pre_title, title, sub_title, image_text, content";
    }
    else if (searchRange == QStringLiteral("仅标题"))
    {
        match = "pre_title, title, sub_title";
    }
    else if (searchRange == QStringLiteral("仅内容"))
    {
        match = "image_text, content";
    }

    QProgressDialog progress(this);
    progress.setFont(this->font());
    progress.setWindowTitle(QStringLiteral("数字报阅读器 - 图文版"));
    progress.setWindowFlags(windowFlags() & (~Qt::WindowContextHelpButtonHint) & (~Qt::WindowMinMaxButtonsHint) & (~Qt::WindowCloseButtonHint));
    progress.setLabelText(QStringLiteral("处理中..."));
    progress.setRange(0, paperDateLst.size() * paperNameLst.size());
    progress.setModal(true);
    progress.setCancelButtonText(QStringLiteral("取消"));
    progress.setMinimumDuration(0);
    connect(&progress, SIGNAL(canceled()), this, SLOT(progressCanceled()));
    int count = 1;

    QStringList columnLst;
    columnLst << "paper_layout" << "pre_title" << "title" << "sub_title" << "author" << QStringLiteral("match(%1) against(%2) as relevance").arg(match).arg(against);

    mSearchDataLst.clear();
    foreach (QString paperName, paperNameLst)
    {
        foreach (QString paperDate, paperDateLst)
        {
            qApp->processEvents(QEventLoop::ExcludeUserInputEvents);

            QString str = QStringLiteral("select columns from t_epaper where match(%1) against(%2) and paper_date = '%3' and paper_name = '%4' order by relevance desc, paper_date desc, seq_num asc;").arg(match).arg(against).arg(paperDate).arg(paperName);

            QList<QStringList> retLst = mDBHelper.getSqlSelect(str, columnLst);
            foreach (QStringList ret, retLst)
            {
                ret.insert(0, paperName);
                ret.insert(1, paperDate);
                mSearchDataLst.append(ret);
            }

            progress.setValue(count++);
        }
    }

    ui->search_pushButton->setEnabled(true);
    on_pushButton_first_page_clicked();
}

void MainWindow::readEpaper(QTreeWidgetItem *item, int column)
{
    Q_UNUSED(column);

    if (!item->text(3).isEmpty())
    {
        QTreeWidgetItem *layout = item->parent();
        QTreeWidgetItem *date = layout->parent();
        QTreeWidgetItem *name = date->parent();

        QString paperTitle = item->text(3);
        QString paperLayout = layout->text(2);
        QString paperDate = date->text(1);
        QString paperName = name->text(0);

        ReadEpaperWidget *widget = new ReadEpaperWidget(paperName, paperDate, paperLayout, paperTitle);
        widget->showMaximized();
    }
}

void MainWindow::on_pushButton_first_page_clicked()
{
    mCurrentPage = 1;
    showSearchData();
}

void MainWindow::on_pushButton_previous_page_clicked()
{
    if (mCurrentPage == 1)
    {
        return;
    }

    mCurrentPage -= 1;
    showSearchData();
}

void MainWindow::on_pushButton_next_page_clicked()
{
    if (mCurrentPage == mTotalPage)
    {
        return;
    }

    mCurrentPage += 1;
    showSearchData();
}

void MainWindow::on_pushButton_last_page_clicked()
{
    mCurrentPage = mTotalPage;
    showSearchData();
}

void MainWindow::on_comboBox_page_size_currentTextChanged(const QString &arg1)
{
    mPageSize = arg1.toInt();
    on_pushButton_first_page_clicked();
}

void MainWindow::readEpaperSearch(QModelIndex index)
{
    Q_UNUSED(index);

    int row = ui->tableWidget_search_result->currentRow();

    QString paperName = ui->tableWidget_search_result->item(row, 0)->text();
    QString paperDate = ui->tableWidget_search_result->item(row, 1)->text();
    QString paperLayout = ui->tableWidget_search_result->item(row, 2)->text();
    QString paperTitle = ui->tableWidget_search_result->item(row, 4)->text();

    QStringList wordLst = ui->lineEdit_keyword->text().split(" ");
    QStringList highLightLst;
    foreach (QString word, wordLst)
    {
        if (!word.isEmpty())
        {
            highLightLst.append(word);
        }
    }

    if (!paperName.isEmpty() && !paperDate.isEmpty() && !paperLayout.isEmpty() &&!paperTitle.isEmpty())
    {
        ReadEpaperWidget *widget = new ReadEpaperWidget(paperName, paperDate, paperLayout, paperTitle, true, highLightLst);
        widget->showMaximized();
    }
}

void MainWindow::progressCanceled()
{
    // dummy
}

readepaperwidget.cpp

#include "readepaperwidget.h"
#include "dbhelper.h"
#include <QTextBrowser>
#include <QVBoxLayout>
#include <QTimer>
#include <QDate>
#include <QMessageBox>
#include <QClipboard>
#include <QApplication>
#include <QMenu>

ReadEpaperWidget::ReadEpaperWidget(const QString &paperName, const QString &paperDate, const QString &paperLayout, const QString &paperTitle, bool highLight, QStringList highLightLst)
{
    QFont font;
    font.setPixelSize(16);
    setFont(font);
    setWindowTitle(QStringLiteral("%1 %2").arg(paperName).arg(QDate::fromString(paperDate, "yyyy-MM-dd").toString(QStringLiteral("yyyy年M月d日"))));
    setAttribute(Qt::WA_DeleteOnClose);

    mTextBrowser = new QTextBrowser(this);
    QVBoxLayout *mainLayout = new QVBoxLayout(this);
    mainLayout->addWidget(mTextBrowser);
    setLayout(mainLayout);
    mTextBrowser->setContextMenuPolicy(Qt::CustomContextMenu);
    connect(mTextBrowser, SIGNAL(customContextMenuRequested(QPoint)), this, SLOT(menuDisplayed(QPoint)));

    mPaperName = paperName;
    mPaperDate = paperDate;
    mPaperLayout = paperLayout;
    mPaperTitle = paperTitle;

    mIsHighLight = highLight;
    mHighLightLst = highLightLst;

    QTimer::singleShot(0, this, SLOT(showArticle()));
}

ReadEpaperWidget::~ReadEpaperWidget()
{

}

bool ReadEpaperWidget::informationMessageBox(const QString &title, const QString &text, bool isOnlyOk)
{
    QMessageBox msgBox(this);
    msgBox.setFont(this->font());
    msgBox.setIcon(QMessageBox::Information);
    msgBox.setWindowTitle(title);
    msgBox.setText(text);
    if (isOnlyOk)
    {
        msgBox.setStandardButtons(QMessageBox::Ok);
        msgBox.setButtonText(QMessageBox::Ok, QStringLiteral("确定"));
    }
    else
    {
        msgBox.setStandardButtons(QMessageBox::Ok | QMessageBox::Cancel);
        msgBox.setButtonText(QMessageBox::Ok, QStringLiteral("确定"));
        msgBox.setButtonText(QMessageBox::Cancel, QStringLiteral("取消"));
    }

    return (msgBox.exec() == QMessageBox::Ok);
}

void ReadEpaperWidget::showArticle()
{
    DBHelper dbHelper;
    QString retStr = dbHelper.getConnectDB();

    if (!retStr.isEmpty())
    {
        informationMessageBox(QStringLiteral("提示"), QStringLiteral("数据库连接失败:\n%1").arg(retStr));
        return;
    }

    QStringList columnLst;
    columnLst << "pre_title" << "sub_title" << "author" << "html_url" << "image_text" << "content";
    QString str = QStringLiteral("select columns from t_epaper where paper_date = '%1' and paper_name = '%2' and paper_layout = '%3' and title = '%4' order by update_time desc limit 1;").arg(mPaperDate).arg(mPaperName).arg(mPaperLayout).arg(mPaperTitle);

    QList<QStringList> retLst = dbHelper.getSqlSelect(str, columnLst);
    foreach (QStringList ret, retLst)
    {
        QString preTitle = ret[0];
        QString subTitle = ret[1];
        QString author = ret[2];
        QString htmlUrl = ret[3];
        QString imageText = ret[4];
        QString content = ret[5];

        mHtmlUrl = htmlUrl;

        QString html;

        html.append(QStringLiteral("<p align='center'>%1 %2 %3</p>").arg(mPaperName).arg(QDate::fromString(mPaperDate, "yyyy-MM-dd").toString(QStringLiteral("yyyy年M月d日"))).arg(mPaperLayout));
        html.append(QStringLiteral("<p align='center'>网页链接:%1</p>").arg(htmlUrl));
        html.append(QStringLiteral("<p align='center'>引题:%1</p>").arg(preTitle));
        html.append(QStringLiteral("<p align='center'>主题:%1</p>").arg(mPaperTitle));
        html.append(QStringLiteral("<p align='center'>副题:%1</p>").arg(subTitle));
        html.append(QStringLiteral("<p align='center'>作者:%1</p>").arg(author));

        QStringList imageLst = imageText.split("; ");
        foreach (QString image, imageLst)
        {
            if (!image.isEmpty())
            {
                QString imageFile = image.split("|")[0]
                        .replace(QStringLiteral("/home/pi/数字报/"), "Z:/");
                QString imageText = image.split("|")[1];

                html.append(QStringLiteral("<p align='center'><img src='%1' /><br></p>").arg(imageFile));
                html.append(QStringLiteral("<p align='center'>%1</p>").arg(imageText.replace("<p>", "").replace("</p>", "")));
            }
        }

        html.append(QStringLiteral("<br>%1").arg(content));

        if (mIsHighLight)
        {
            foreach (QString highLight, mHighLightLst)
            {
                html.replace(highLight, QStringLiteral("<strong><font color='blue'>%1</font></strong>").arg(highLight));
            }
        }

        mTextBrowser->setHtml(html);

        QTextCursor textCursor(mTextBrowser->textCursor());
        textCursor.movePosition(QTextCursor::Start);
        mTextBrowser->setTextCursor(textCursor);
    }
}

void ReadEpaperWidget::menuDisplayed(const QPoint &pos)
{
    Q_UNUSED(pos);

    QMenu *menu = new QMenu(this);

    QAction *copy = new QAction(QStringLiteral("复制"), this);
    connect(copy, SIGNAL(triggered(bool)), mTextBrowser, SLOT(copy()));

    QAction *selectAll = new QAction(QStringLiteral("选择全部"), this);
    connect(selectAll, SIGNAL(triggered(bool)), mTextBrowser, SLOT(selectAll()));

    QAction *copyPlainText = new QAction(QStringLiteral("复制纯文本"), this);
    connect(copyPlainText, SIGNAL(triggered(bool)), this, SLOT(copyPlainText()));

    QAction *copyHtmlUrl = new QAction(QStringLiteral("复制网页链接"), this);
    connect(copyHtmlUrl, SIGNAL(triggered(bool)), this, SLOT(copyHtmlUrl()));

    menu->addAction(copy);
    menu->addAction(selectAll);
    menu->addSeparator();
    menu->addAction(copyPlainText);
    menu->addAction(copyHtmlUrl);

    menu->exec(QCursor::pos());

    delete copy;
    delete selectAll;
    delete copyPlainText;
    delete copyHtmlUrl;
    delete menu;
}

void ReadEpaperWidget::copyHtmlUrl()
{
    QClipboard *clipboard = QApplication::clipboard();
    clipboard->setText(mHtmlUrl);
}

void ReadEpaperWidget::copyPlainText()
{
    QClipboard *clipboard = QApplication::clipboard();
    clipboard->setText(mTextBrowser->toPlainText());
}

epaper_v2.py

#!/usr/bin/python3 
# coding: utf-8

from urllib import request
from urllib import error
from urllib.parse import quote
import re
import threading
import os
import sys
import string
import time
import datetime
import random
import schedule
import fcntl
import queue
import pymysql
import hashlib
from DBUtils.PooledDB import PooledDB

import socket
socket.setdefaulttimeout(20.0)


pdf_dir = "/home/pi/数字报/"
epapers_done_dict = {}

download_image_queue = queue.Queue()
mysql_insert_queue = queue.Queue()
download_image_error_queue = queue.Queue()
mysql_insert_error_queue = queue.Queue()

download_image_error_dict = {}
mysql_insert_error_dict = {}
download_image_error_dict_lock = threading.Lock()
mysql_insert_error_dict_lock = threading.Lock()
mysql_insert_log_file_dir_lock = threading.Lock()

download_image_thread_lst = []
mysql_insert_thread_lst = []
download_image_error_thread_lst = []
mysql_insert_error_thread_lst = []

mysql_pool = PooledDB(pymysql, 10, host="host", user="user", password="password", database="database", port=3306)


def my_escape_string(src_str):
    return src_str.replace("\\", "\\\\").replace("'", "\\'").replace("\"", "\\\"")\
            .replace("\n", "\\n").replace("\r", "\\r").replace("\0", "\\0")\
            .replace("&nbsp; ", " ").replace("&nbsp;", " ")\
            .replace("\u3000", " ").replace("\u0020", " ").replace("\xa0", " ")\
            .replace("\u00a0", " ").replace(",", ",").replace(";", ";")


def my_print(paper_name, paper_date, paper_info):
    log_file_dir = "/home/pi/python_svn/%s%s-logs/%s/" %(paper_date.split("-")[0], paper_date.split("-")[1], paper_date)
    if not os.path.isdir(log_file_dir):
        os.makedirs(log_file_dir)

    log_file_name = "%s%s.txt" %(log_file_dir, paper_name)
    with open(log_file_name, "a", encoding="utf-8") as f:
        fcntl.flock(f, fcntl.LOCK_EX)
        date_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
        print("[%s] [%s %s] %s\n" %(date_time, paper_name, paper_date, paper_info))
        f.write("[%s] [%s %s] %s\n\n" %(date_time, paper_name, paper_date, paper_info))
        fcntl.flock(f, fcntl.LOCK_UN)


def download_image(image_file_name, image_url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    req = request.Request(url=quote(image_url, safe=string.printable), headers=headers)
    image = "%s %s" %(image_file_name, image_url)
    try:
        total = 0
        with request.urlopen(req) as f:
            with open("%s" %(image_file_name), "wb") as f2:
                while True:
                    buff = f.read(1024 * 100)
                    if not buff:
                        break 

                    f2.write(buff)
                    total = total + 1024 * 100
                    sys.stdout.write("download %d KB\r" %(total / 1024))
                    sys.stdout.flush()
                    
        print("download %s OK [%d KB]\n" %(image, total / 1024))
    except error.HTTPError as e:
        if e.code == 404:
            print("download %s ERROR [404]\n" %(image))
    except Exception:
        md5_image = hashlib.md5(image.encode(encoding="UTF-8")).hexdigest()
        download_image_error_dict_lock.acquire()
        if md5_image not in download_image_error_dict.keys():
            download_image_error_dict[md5_image] = 1
            download_image_error_queue.put(image)
        elif download_image_error_dict[md5_image] <= 2:
            download_image_error_dict[md5_image] += 1
            download_image_error_queue.put(image)
        download_image_error_dict_lock.release()
        print("download %s ERROR\n" %(image))


def download_image_thread():
    while True:
        image_lst = download_image_queue.get()
        image_file_name = image_lst[0]
        image_url = image_lst[1]
        image = "%s %s" %(image_file_name, image_url)

        print("%s: %s\n" %(threading.current_thread().name, image))

        download_image(image_file_name, image_url)
        
        download_image_queue.task_done()


def mysql_insert_thread():
    while True:
        sql_data = mysql_insert_queue.get()
        info_lst = sql_data.split(", ")
        paper_name = str(info_lst[0])[1:-1]
        paper_date = str(info_lst[1])[1:-1]
        paper_layout = str(info_lst[2])[1:-1]
        pre_title = str(info_lst[3])[1:-1]
        title = str(info_lst[4])[1:-1]
        sub_title = str(info_lst[5])[1:-1]
        author = str(info_lst[6])[1:-1]
        image_text = str(info_lst[7])[1:-1]
        content = str(info_lst[8])[1:-1]
        html_url = str(info_lst[9])[1:-1]
        seq_num = int(info_lst[10])

        info = "%s, %s, %s, %s, %s, %s" %(paper_name, paper_date, paper_layout, pre_title, title, sub_title)
        md5_str = hashlib.md5(info.encode(encoding="UTF-8")).hexdigest()
        md5_url = hashlib.md5(html_url.encode(encoding="UTF-8")).hexdigest()
        date_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
        
        sql = "insert ignore into t_epaper values('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', %d, '%s');" %(md5_str, md5_url, paper_name, paper_date, paper_layout, pre_title, title, sub_title, author, image_text, content, html_url, seq_num, date_time)

        print("%s: %s\n" %(threading.current_thread().name, info))
		
        date = time.strftime("%Y-%m-%d", time.localtime())
        log_file_dir = "/home/pi/python_svn/mysql_insert-logs/%s/" %(date)
        mysql_insert_log_file_dir_lock.acquire()
        if not os.path.isdir(log_file_dir):
            os.makedirs(log_file_dir)
        mysql_insert_log_file_dir_lock.release()

        log_file_name = "%s%s.txt" %(log_file_dir, threading.current_thread().name)

        with open(log_file_name, "a", encoding="utf-8") as f:
            fcntl.flock(f, fcntl.LOCK_EX)
            date_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
            f.write("[%s] %s\n\n" %(date_time, sql))
            fcntl.flock(f, fcntl.LOCK_UN)

        try:
            mysql_conn = mysql_pool.connection()
            mysql_conn.ping(reconnect=True)
            with mysql_conn.cursor() as cursor:
                cursor.execute(sql)
            mysql_conn.commit()
        except Exception as e:
            mysql_conn.rollback()
            md5_sql = hashlib.md5(sql.encode(encoding="UTF-8")).hexdigest()
            mysql_insert_error_dict_lock.acquire()
            if md5_sql not in mysql_insert_error_dict.keys():
                mysql_insert_error_dict[md5_sql] = 1
                mysql_insert_error_queue.put("%s#%s" %(sql, e))
            elif mysql_insert_error_dict[md5_sql] <= 2:
                mysql_insert_error_dict[md5_sql] += 1
                mysql_insert_error_queue.put("%s#%s" %(sql, e))
            mysql_insert_error_dict_lock.release()
            print("%s: %s %s\n" %(threading.current_thread().name, info, e))
        else:
            print("%s: %s OK\n" %(threading.current_thread().name, info))
        finally:
            mysql_conn.close()

        mysql_insert_queue.task_done()


def download_image_error_thread():
    while True:
        image = download_image_error_queue.get()
        image_lst = image.split(" ")
        image_file_name = image_lst[0]
        image_url = image_lst[1]

        print("%s: %s\n" %(threading.current_thread().name, image))

        if os.path.isfile(image_file_name) and os.path.getsize(image_file_name):
            os.remove(image_file_name)
        
        download_image(image_file_name, image_url)

        download_image_error_queue.task_done()


def mysql_insert_error_thread():
    while True:
        sql_data = mysql_insert_error_queue.get()
        sql = sql_data.split(");#")[0] + ");"
        err = sql_data.split(");#")[1]

        date = time.strftime("%Y-%m-%d", time.localtime())
        log_file_dir = "/home/pi/python_svn/mysql_insert-logs/%s/" %(date)
        mysql_insert_log_file_dir_lock.acquire()
        if not os.path.isdir(log_file_dir):
            os.makedirs(log_file_dir)
        mysql_insert_log_file_dir_lock.release()

        log_file_name = "%s%s.txt" %(log_file_dir, threading.current_thread().name)
        with open(log_file_name, "a", encoding="utf-8") as f:
            fcntl.flock(f, fcntl.LOCK_EX)
            date_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
            f.write("[%s] %s\n%s\n\n" %(date_time, sql, err))
            fcntl.flock(f, fcntl.LOCK_UN)

        try:
            mysql_conn = mysql_pool.connection()
            mysql_conn.ping(reconnect=True)
            with mysql_conn.cursor() as cursor:
                cursor.execute(sql)
            mysql_conn.commit()
        except Exception as e:
            mysql_conn.rollback()
            md5_sql = hashlib.md5(sql.encode(encoding="UTF-8")).hexdigest()
            mysql_insert_error_dict_lock.acquire()
            if md5_sql not in mysql_insert_error_dict.keys():
                mysql_insert_error_dict[md5_sql] = 1
                mysql_insert_error_queue.put("%s#%s" %(sql, e))
            elif mysql_insert_error_dict[md5_sql] <= 2:
                mysql_insert_error_dict[md5_sql] += 1
                mysql_insert_error_queue.put("%s#%s" %(sql, e))
            mysql_insert_error_dict_lock.release()
            print("%s: %s\n" %(threading.current_thread().name, e))
        else:
            print("%s: OK\n" %(threading.current_thread().name))
        finally:
            mysql_conn.close()

        mysql_insert_error_queue.task_done()


def process_image_lst(paper_name, date, detail_lst, sql_data, image_url):
    image_file_dir = "%s%s/IMG/%s/%s%s%s/" %(pdf_dir, paper_name, date.split("-")[0], date.split("-")[0], date.split("-")[1], date.split("-")[2])

    if not os.path.isdir(image_file_dir):
        os.makedirs(image_file_dir)
        
    image_lst = []

    sql_data += ", '" 
    for i in range(0, len(detail_lst)):
        sql_image = ""
        if type(detail_lst[i]) is tuple:
            image_name = detail_lst[i][0]
            image_text = my_escape_string(detail_lst[i][1])

            url = image_url %(image_name)
            image_file_name = "%s%s_%s" %(image_file_dir, url.split("/")[-2].replace("-", "").replace(".", ""), url.split("/")[-1])
            sql_image = "%s|%s" %(image_file_name, image_text)

            if os.path.isfile(image_file_name) and os.path.getsize(image_file_name):
                print("download %s %s OK, File Exists\n" %(image_file_name, url))
            else:
                lst = [image_file_name, url]
                image_lst.append(lst)

        sql_data += "%s" %(sql_image)
        if i != (len(detail_lst) - 1):
            sql_data += "; "

    sql_data += "'"

    if image_lst:
        for image in image_lst:
            download_image_queue.put(image)

    return sql_data 


def mysql_insert(sql_data):
    mysql_insert_queue.put(sql_data)


def get_data_start_end(html, start_tag, end_tag, reg_str):
    start = False 
    match_lst = []
    for line in html:
        if line.find(start_tag) != -1:
            start = True 
            continue 

        if start:
            if line.find(end_tag) != -1:
                break 
            
            m = re.findall(reg_str, line)
            if m:
                for match in m:
                    match_lst.append(match)

    return match_lst


def get_data_line(html, start_tag, reg_str):
    match_lst = []
    for line in html:
        if line.find(start_tag) != -1:
            m = re.findall(reg_str, line)
            if m:
                for match in m:
                    match_lst.append(match)
            break 

    return match_lst 


def get_multi_data_start_end(html, start_tag, end_tag, reg_str_lst):
    start = False 
    strip_line = ""
    for line in html:
        if line.find(start_tag) != -1:
            start = True 
            continue 

        if start:
            if line.find(end_tag) != -1:
                break 
            
            strip_line += line.strip()

    match_lst = []
    for reg_str in reg_str_lst:
        m = re.findall(reg_str, strip_line)
        if m:
            for match in m:
                match_lst.append(match)
        else:
            match_lst.append("")

    return match_lst 


def get_rmrb(paper_name, date):
    index_url = "http://paper.people.com.cn/rmrb/html/%s-%s/%s/nbs.D110000renmrb_01.htm" %(date.split("-")[0], date.split("-")[1], date.split("-")[2])
    my_print(paper_name, date, "start get %s" %(index_url))

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    req = request.Request(url=index_url, headers=headers)
    layout_name_url_lst = []
    layout_article_lst = []
    flag_index = True 
    flag_other = True 
    try:
        with request.urlopen(req) as f:
            data = f.read().decode("utf-8", "ignore").split("\r\n")
            layout_name_lst = get_data_start_end(data, "<div id=\"pageList\" style=\"overflow:hidden;height:440px;\">", "<iframe id=\"postIframe\"", r"\.htm>(.+?)</a></div>")
            layout_url_lst = get_data_start_end(data, "<div id=\"pageList\" style=\"overflow:hidden;height:440px;\">", "<iframe id=\"postIframe\"", r"<a id=pageLink href=(.+?)\.htm>")
            for i in range(0, len(layout_name_lst)):
                layout_url = "http://paper.people.com.cn/rmrb/html/%s-%s/%s/%s.htm" %(date.split("-")[0], date.split("-")[1], date.split("-")[2], layout_url_lst[i].replace("./", ""))
                layout_name = layout_name_lst[i]
                layout_name_url = {
                        "name": layout_name, "url": layout_url
                        }
                layout_name_url_lst.append(layout_name_url)
    except Exception as e:
        flag_index = False 
        layout_name_url_lst = []
        my_print(paper_name, date, "get %s ERROR\n%s" %(index_url, e))

    for layout_name_url in layout_name_url_lst:
        layout_url = layout_name_url["url"]
        layout_name = layout_name_url["name"]
        my_print(paper_name, date, "get %s %s" %(layout_name, layout_url))

        req = request.Request(url=layout_url, headers=headers)
        articles = []
        try:
            with request.urlopen(req) as f:
                data = f.read().decode("utf-8", "ignore").split("\r\n")
                article_urls = get_data_start_end(data, "<div id=\"titleList\"  style=\"height:440px;overflow:hidden;\">", "</div>", r"<a href=(.+?)\.htm><script>")
                for article_url in article_urls:
                    url = "http://paper.people.com.cn/rmrb/html/%s-%s/%s/%s.htm" %(date.split("-")[0], date.split("-")[1], date.split("-")[2], article_url)
                    articles.append(url)
                layout_article = {
                        "layout": layout_name, "articles": articles 
                        }
                layout_article_lst.append(layout_article)
                my_print(paper_name, date, "get %s %s OK" %(layout_name, layout_url))
        except Exception as e:
            flag_other = False 
            my_print(paper_name, date, "get %s %s ERROR\n%s" %(layout_name, layout_url, e))

    count = 1
    for layout_article in layout_article_lst:
        articles = layout_article["articles"]
        for article_url in articles:
            my_print(paper_name, date, "get %s" %(article_url))

            req = request.Request(url=article_url, headers=headers)
            try:
                with request.urlopen(req) as f:
                    data = f.read().decode("utf-8", "ignore").split("\n")
                    title_lst = get_multi_data_start_end(data, "<div style=\"margin:15px auto 0 auto; width:550px;\">", "<div class=\"tool_t\">", [r"<h3>(.*?)</h3><h1>(.*?)</h1><h2>(.*?)</h2><h4>(.*?)</h4>"])
                    image_lst = get_multi_data_start_end(data, "<div style=\"margin:15px auto 0 auto; width:550px;\">", "<div class=\"tool_t\">", [r"<TR><TD align=\"center\"><IMG src=\"\.\./\.\./\.\./res(.+?)\.jpg\"></TD></TR> <TR><TD>(.*?)</TD></TR>"])
                    content_lst = get_multi_data_start_end(data, "<div style=\"margin:15px auto 0 auto; width:550px;\">", "<div class=\"tool_t\">", [r"<!--enpcontent-->(.*?)<!--/enpcontent-->"])

                    sql_data = "'%s', '%s', '%s', '%s', '%s', '%s', '%s'" %(paper_name, date, my_escape_string(layout_article["layout"]), my_escape_string(title_lst[0][0]), my_escape_string(title_lst[0][1]), my_escape_string(title_lst[0][2]), my_escape_string(title_lst[0][3]))
                    sql_data = process_image_lst(paper_name, date, image_lst, sql_data, "http://paper.people.com.cn/rmrb/res%s.jpg")
                    sql_data += ", '%s', '%s', %d" %(my_escape_string(content_lst[0]), article_url, count)
                    count += 1
                    mysql_insert(sql_data)
                    my_print(paper_name, date, "get %s OK" %(article_url))
            except Exception as e:
                flag_other = False 
                my_print(paper_name, date, "get %s ERROR\n%s" %(article_url, e))

    if flag_index and flag_other and layout_article_lst:
        if paper_name not in epapers_done_dict.get(date, []):
            epapers_done_dict.setdefault(date, []).append(paper_name)
        my_print(paper_name, date, "OK\n\n")
    else:
        my_print(paper_name, date, "NOT OK\n\n")


def get_zqcn(paper_name, date):
    index_url = "http://epaper.zqcn.com.cn/content/%s-%s/%s/node_2.htm" %(date.split("-")[0], date.split("-")[1], date.split("-")[2])
    my_print(paper_name, date, "start get %s" %(index_url))

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    req = request.Request(url=index_url, headers=headers)
    layout_name_url_lst = []
    layout_article_lst = []
    flag_index = True 
    flag_other = True 
    try:
        with request.urlopen(req) as f:
            data = f.read().decode("utf-8", "ignore").split("\r\n")
            layout_name_lst = get_data_line(data, "<td height=\"240\" valign=\"top\" class=\"mulu04\">", r"\.htm>(.+?)</a></td>")
            layout_url_lst = get_data_line(data, "<td height=\"240\" valign=\"top\" class=\"mulu04\">", r"<a id=pageLink href=(.+?)\.htm>")
            for i in range(0, len(layout_name_lst)):
                layout_url = "http://epaper.zqcn.com.cn/content/%s-%s/%s/%s.htm" %(date.split("-")[0], date.split("-")[1], date.split("-")[2], layout_url_lst[i].replace("./", ""))
                layout_name = layout_name_lst[i]
                layout_name_url = {
                        "name": layout_name, "url": layout_url 
                        }
                layout_name_url_lst.append(layout_name_url)
    except Exception as e:
        flag_index = False 
        layout_name_url_lst = []
        my_print(paper_name, date, "get %s ERROR\n%s" %(index_url, e))

    for layout_name_url in layout_name_url_lst:
        layout_url = layout_name_url["url"]
        layout_name = layout_name_url["name"]
        my_print(paper_name, date, "get %s %s" %(layout_name, layout_url))

        req = request.Request(url=layout_url, headers=headers)
        articles = []
        try:
            with request.urlopen(req) as f:
                data = f.read().decode("utf-8", "ignore").split("\r\n")
                article_urls = get_data_start_end(data, "<div class=\"gundong\">", "<td class=\"bm03\">&nbsp;</td>", r"<li><a href=(.+?)\.htm>")
                for article_url in article_urls:
                    url = "http://epaper.zqcn.com.cn/content/%s-%s/%s/%s.htm" %(date.split("-")[0], date.split("-")[1], date.split("-")[2], article_url)
                    articles.append(url)
                layout_article = {
                        "layout": layout_name, "articles": articles 
                        }
                layout_article_lst.append(layout_article)
                my_print(paper_name, date, "get %s %s OK" %(layout_name, layout_url))
        except Exception as e:
            flag_other = False 
            my_print(paper_name, date, "get %s %s ERROR\n%s" %(layout_name, layout_url, e))

    count = 1
    for layout_article in layout_article_lst:
        articles = layout_article["articles"]
        for article_url in articles:
            my_print(paper_name, date, "get %s" %(article_url))

            req = request.Request(url=article_url, headers=headers)
            try:
                with request.urlopen(req) as f:
                    data = f.read().decode("utf-8", "ignore").split("\n")
                    title_lst = get_multi_data_start_end(data, "<div class=\"title01\"", "</div>", [r"<h2>(.+?)</h2> <h1>", r"</h2> <h1>(.+?)</h1>", r"</h1> <h2>(.+?)</h2>", r"margin-top:12px;\">(.+?)</p>"])
                    image_lst = get_multi_data_start_end(data, "<div class=\"title02\"", "</div>", [r"/attachement(.+?)\.jpg\"></TD></TR><TR><TD>(.*?)</TD></TR></TBODY>"])
                    content_lst = get_multi_data_start_end(data, "<div class=\"title03\"", "<div style=\"clear:both;\">", [r"<div id=ozoom style=\"ZOOM: 100%\"><founder-content>(.+?)</founder-content></div>"])
                    sql_data = "'%s', '%s', '%s', '%s', '%s', '%s', '%s'" %(paper_name, date, my_escape_string(layout_article["layout"]), my_escape_string(title_lst[0]), my_escape_string(title_lst[1]), my_escape_string(title_lst[2]), my_escape_string(title_lst[3]))
                    sql_data = process_image_lst(paper_name, date, image_lst, sql_data, "http://epaper.zqcn.com.cn/attachement%s.jpg")
                    sql_data += ", '%s', '%s', %d" %(my_escape_string(content_lst[0]), article_url, count)
                    count += 1
                    mysql_insert(sql_data)
                    my_print(paper_name, date, "get %s OK" %(article_url))
            except Exception as e:
                flag_other = False 
                my_print(paper_name, date, "get %s ERROR\n%s" %(article_url, e))

    if flag_index and flag_other and layout_article_lst:
        if paper_name not in epapers_done_dict.get(date, []):
            epapers_done_dict.setdefault(date, []).append(paper_name)
        my_print(paper_name, date, "OK\n\n")
    else:
        my_print(paper_name, date, "NOT OK\n\n")


def get_cdrb(paper_name, date):
    index_url = "http://www.cdrb.com.cn/epaper/cdrbpc/%s%s/%s/l01.html" %(date.split("-")[0], date.split("-")[1], date.split("-")[2])
    my_print(paper_name, date, "start get %s" %(index_url))

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    req = request.Request(url=index_url, headers=headers)
    layout_name_url_lst = []
    layout_article_lst = []
    flag_index = True 
    flag_other = True 
    try:
        with request.urlopen(req) as f:
            data = f.read().decode("utf-8", "ignore").split("\r\n")
            layout_name_lst = get_data_start_end(data, "<div class=\"nav-list\" style=\"height:650px;overflow-y:scroll\">", "</div>", r"class=\"btn btn-block\">(.+?)<")
            layout_url_lst = get_data_start_end(data, "<div class=\"nav-list\" style=\"height:650px;overflow-y:scroll\">", "</div>", r"<a href=\"(.+?)\.html\" class=\"btn btn-block\">")
            for i in range(0, len(layout_name_lst)):
                layout_url = "http://www.cdrb.com.cn/epaper/cdrbpc/%s%s/%s/%s.html" %(date.split("-")[0], date.split("-")[1], date.split("-")[2], layout_url_lst[i])
                layout_name = layout_name_lst[i]
                layout_name_url = {
                        "name": layout_name, "url": layout_url 
                        }
                layout_name_url_lst.append(layout_name_url)
    except Exception as e:
        flag_index = False 
        layout_name_url_lst = []
        my_print(paper_name, date, "start get %s ERROR\n%s" %(index_url, e))

    for layout_name_url in layout_name_url_lst:
        layout_url = layout_name_url["url"]
        layout_name = layout_name_url["name"]
        my_print(paper_name, date, "get %s %s" %(layout_name, layout_url))

        req = request.Request(url=layout_url, headers=headers)
        articles = []
        try:
            with request.urlopen(req) as f:
                data = f.read().decode("utf-8", "ignore").split("\r\n")
                article_urls = get_data_start_end(data, "<div class=\"news-list\" style=\"height: 560px;overflow-y: auto;\">", "</div>", r"<a href=\"(.+?)\.html\">")
                for article_url in article_urls:
                    url = "%s.html" %(article_url)
                    articles.append(url)
                layout_article = {
                        "layout": layout_name, "articles": articles 
                        }
                layout_article_lst.append(layout_article)
                my_print(paper_name, date, "get %s %s OK" %(layout_name, layout_url))
        except Exception as e:
            flag_other = False 
            my_print(paper_name, date, "get %s %s ERROR\n%s" %(layout_name, layout_url, e))

    count = 1
    for layout_article in layout_article_lst:
        articles = layout_article["articles"]
        for article_url in articles:
            my_print(paper_name, date, "get %s" %(article_url))

            req = request.Request(url=article_url, headers=headers)
            try:
                with request.urlopen(req) as f:
                    data = f.read().decode("utf-8", "ignore").split("\n")
                    title_lst = get_multi_data_start_end(data, "<div class=\"detail-art\">", "</founder-content>", [r"<p class=\"introtitle text-center\" id=\"PreTitle\">(.*?)</p><h2 id=\"Title\" class=\"art-title text-center\">(.*?)</h2><p class=\"subtitle text-center\" id=\"SubTitle\">(.*?)</p>"])
                    image_lst = get_multi_data_start_end(data, "<div class=\"detail-art\">", "</founder-content>", [r"<img src=\"(.+?)\.jpg\.1\" width=\"550px\"><p>(.*?)</p>"])
                    content_lst = get_multi_data_start_end(data, "<div class=\"detail-art\">", "</founder-content>", [r"<!--enpcontent-->(.*?)<!--/enpcontent-->"])
                    sql_data = "'%s', '%s', '%s', '%s', '%s', '%s', '%s'" %(paper_name, date, my_escape_string(layout_article["layout"]), my_escape_string(title_lst[0][0]), my_escape_string(title_lst[0][1]), my_escape_string(title_lst[0][2]), '')
                    sql_data = process_image_lst(paper_name, date, image_lst, sql_data, "%s.jpg")
                    sql_data += ", '%s', '%s', %d" %(my_escape_string(content_lst[0]), article_url, count)
                    count += 1
                    mysql_insert(sql_data)
                    my_print(paper_name, date, "get %s OK" %(article_url))
            except Exception as e:
                flag_other = False 
                my_print(paper_name, date, "get %s ERROR\n%s" %(article_url, e))

    if flag_index and flag_other and layout_article_lst:
        if paper_name not in epapers_done_dict.get(date, []):
            epapers_done_dict.setdefault(date, []).append(paper_name)
        my_print(paper_name, date, "OK\n\n")
    else:
        my_print(paper_name, date, "NOT OK\n\n")


def check_status():
    date = time.strftime("%Y-%m-%d", time.localtime())
    log_file_dir = "/home/pi/python_svn/%s%s-logs/" %(date.split("-")[0], date.split("-")[1])
    if not os.path.isdir(log_file_dir):
        os.makedirs(log_file_dir)

    log_file_name = "%slogs_%s.txt" %(log_file_dir, date.split("-")[2])
    with open(log_file_name, "a", encoding="utf-8") as f:
        fcntl.flock(f, fcntl.LOCK_EX)
        date_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
        f.write("[%s] epapers_done_dict_v2\n" %(date_time))
        for key, value in epapers_done_dict.items():
            done_lst = []
            todo_lst = []
            for key2 in epapers.keys():
                if key2 not in value:
                    todo_lst.append(key2)
                else:
                    done_lst.append(key2)

            f.write("%s Done(%d): %s\n" %(key, len(done_lst), ", ".join(done_lst)))
            f.write("%s Todo(%d): %s\n" %(key, len(todo_lst), ", ".join(todo_lst)))
            
            print("%s Done(%d): %s\n" %(key, len(done_lst), ", ".join(done_lst)))
            print("%s Todo(%d): %s\n" %(key, len(todo_lst), ", ".join(todo_lst))) 

        for t in download_image_thread_lst:
            f.write("%s: %s\n" %(t.name, t.is_alive()))
            print("%s: %s\n" %(t.name, t.is_alive()))

        for t in mysql_insert_thread_lst:
            f.write("%s: %s\n" %(t.name, t.is_alive()))
            print("%s: %s\n" %(t.name, t.is_alive()))

        for t in download_image_error_thread_lst:
            f.write("%s: %s\n" %(t.name, t.is_alive()))
            print("%s: %s\n" %(t.name, t.is_alive()))

        for t in mysql_insert_error_thread_lst:
            f.write("%s: %s\n" %(t.name, t.is_alive()))
            print("%s: %s\n" %(t.name, t.is_alive()))

        f.write("\n\n")
        fcntl.flock(f, fcntl.LOCK_UN)


def clean_data():
    global epapers_done_dict 
    global download_image_error_dict 
    global mysql_insert_error_dict 
    epapers_done_dict = {}
    download_image_error_dict = {}
    mysql_insert_error_dict = {}


def get_epapers_date(date):
    for key, value in epapers.items():
        if key not in epapers_done_dict.get(date, []):
            t = threading.Thread(target=value, args=(key, date))
            t.start()
            t.join()

            second = random.randint(5, 10)
            time.sleep(second)

    check_status()


def get_epapers_today():
    today = time.strftime("%Y-%m-%d", time.localtime())
    get_epapers_date(today)


def is_network_connect():
    print("Get Network Connection Status")
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    req = request.Request(url="https://www.baidu.com", headers=headers)
    try:
        request.urlopen(req)
        print("Network Connection Status: OK")
        return True 
    except:
        print("Network Connection Status: ERROR")
        return False 


def start_thread():
    for i in range(0, 20):
        t = threading.Thread(target=download_image_thread, args=(), name="download_image_thread_%s" %(i))
        download_image_thread_lst.append(t)
        t.start()
    
    for i in range(0, 10):
        t = threading.Thread(target=mysql_insert_thread, args=(), name="mysql_insert_thread_%s" %(i))
        mysql_insert_thread_lst.append(t)
        t.start()
    
    for i in range(0, 1):
        t = threading.Thread(target=download_image_error_thread, args=(), name="download_image_error_thread_%s" %(i))
        download_image_error_thread_lst.append(t)
        t.start()
    
    for i in range(0, 1):
        t = threading.Thread(target=mysql_insert_error_thread, args=(), name="mysql_insert_error_thread_%s" %(i))
        mysql_insert_error_thread_lst.append(t)
        t.start()


epapers = {
        "人民日报": get_rmrb, "中国企业报": get_zqcn, "成都日报": get_cdrb,
        }


if __name__ == "__main__":
    if is_network_connect():
        start_thread()
        background = True 
        #background = False 
        if background:
            time_lst = ["10:00", "14:00", "18:00", "22:00"]
            delta = datetime.timedelta(minutes=-5)
            for get_time in time_lst:
                dt = datetime.datetime.strptime(get_time, "%H:%M") + delta 
                check_time = dt.strftime("%H:%M")
                schedule.every().day.at(check_time).do(check_status)
                schedule.every().day.at(get_time).do(get_epapers_today)

            schedule.every().day.at("03:00").do(clean_data)
            
            print("Schedule Jobs: ")
            for job in schedule.jobs:
                print("%s()" %(str(job).split("()")[0]))

            while True:
                schedule.run_pending()
                time.sleep(1)
        else:
            get_epapers_today()
    else:
        print("---Network Unconnected")

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值