利用正则表达式提取docx转为txt的文件。

最新推荐文章于 2024-06-06 10:09:07 发布

liuhua3000

最新推荐文章于 2024-06-06 10:09:07 发布

阅读量885

点赞数

文章标签：正则表达式 latex JavaScript

本文链接：https://blog.csdn.net/liuhua3000/article/details/73741844

版权

使用pandoc先转化docx文件。

pandoc -f docx -t latex -o t33.txt testAp.docx

提取出的txt格式是这样的

\section{Question1}\label{question1}

\subsection{问题}\label{ux95eeux9898}

\begin{quote}
The random variable \(X\) has the probability distribution shown in the
table.
\end{quote}

\begin{longtable}[]{@{}llll@{}}
\toprule
\(x\) & 2 & 4 & 6\tabularnewline
\midrule
\endhead
\(P\left( X = x \right)\) & 0.5 & 0.4 & 0.1\tabularnewline
\bottomrule
\end{longtable}

Two \emph{independent} values of \(X\) are chosen at random. The random
variable \(Y\) takes the value s of \(X\) are the same. Otherwise the
value of \(Y\) is the larger value of \(X\) minus the smaller value of
\(X\).

(i) Draw up the probability distribution table for \(Y\).

(ii) Find the expected value of \(Y\).

\subsection{关键字}\label{ux5173ux952eux5b57}

\begin{quote}
独立 可能性 期待
\end{quote}

\subsection{翻译}\label{ux7ffbux8bd1}

\begin{quote}
随机变量X 可能的分布情况如下所示。
\end{quote}

\subsection{选项}\label{ux9009ux9879}

\begin{enumerate}
\def\labelenumi{\Alph{enumi}.}
\item
  21+ \(f(x)\)
\item
  22
\item
  23
\end{enumerate}

\subsection{答案}\label{ux7b54ux6848}

A

\subsection{提示}\label{ux63d0ux793a}

参考统计学的平均数公式。

期望值公式。

\subsection{解析}\label{ux89e3ux6790}

(i)

\begin{longtable}[]{@{}llll@{}}
\toprule
\(r\) & 0 & 2 & 4\tabularnewline
\midrule
\endhead
\(P(Y = r)\) & 0.42 & 0.48 & 0.1\tabularnewline
\bottomrule
\end{longtable}

\(P\left( Y = 0 \right) = 0.5 \times 0.5 + 0.4 \times 0.4 + 0.1 \times 0.1 = 0.42\)
\textbackslash{}\textbackslash{}

\(P\left( Y = 2 \right) = 0.5 \times 0.4 \times 2 + 0.4 \times 0.1 \times 2 = 0.48\)

\(P\left( Y = 4 \right) = 0.5 \times 0.1 \times 2 = 0.1\)

(ii) \(\left( Y \right) = 0 \times 0.42 + 2 \times 0.48 + 4 \times 0.1\)
\textbackslash{}\textbackslash{}

\(= 1.36\)

\subsection{结束}\label{ux7ed3ux675f}

我在这里约定了一些格式，用来判定问题。最后提取的正则表达式为。

var regx = /\\subsection{问题}\\label.*\n?((\s|\S)*?)\\subsection{关键字}/;
    var regxKey = /\\subsection{关键字}\\label.*\n?((\s|\S)*?)\\subsection{翻译}/;
    var regxTrans = /\\subsection{翻译}\\label.*\n?((\s|\S)*?)\\subsection{选项}/;
    var regxChoice =/\\subsection{选项}\\label.*\n?((\s|\S)*?)\\subsection{答案}/;
    var regxAnswer = /\\subsection{答案}\\label.*\n?((\s|\S)*?)\\subsection{提示}/;
    var regxHint = /\\subsection{提示}\\label.*\n?((\s|\S)*?)\\subsection{解析}/;
    var regxStep = /\\subsection{解析}\\label.*\n?((\s|\S)*?)\\subsection{结束}/;

在这里我原来习惯用 (.|\n)*?来提取任何字符，但是在这里不知道为何失效，最后我用的是 \s|\S,这令我很是疑惑，不知道Xregex的任何字符功能是否依然这样。
最后一点，txt格式中，因为是windows系统，所以用的是\r\n换行，我没能转变思路，所以折腾了很久。
在txt中，用一个空白行表示一段的结束，而我这里需要将分段转为用<p></p>表示。我需要对格式进行重新整理。

str = str.replace(/(\S)\r\n(\S)/g, "$1 $2");

这行代码的意思是如果是单个换行符，就用空格代替，我的目的是将单个换行符用空格代替，而两个连续换行符则保持不变。
本来计划是用split来将字符串分隔开再重新连接的，现在看来还是replace比较可靠。

替换表格

liuhua3000

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
利用正则表达式提取docx转为txt的文件。

使用pandoc先转化docx文件。pandoc -f docx -t latex -o t33.txt testAp.docx提取出的txt格式是这样的\section{Question1}\label{question1}\subsection{问题}\label{ux95eeux9898}\begin{quote}The random variable \(X\) has the proba
复制链接

扫一扫