mac系统 grep不支持-P选项的解决办法，grep匹配utf8中文字符

最新推荐文章于 2024-07-18 09:10:19 发布

youwen21

最新推荐文章于 2024-07-18 09:10:19 发布

阅读量7.8k

点赞数 2

分类专栏：命令行-termial 文章标签： mac grep grep匹配utf8中文

本文链接：https://blog.csdn.net/youwen21/article/details/100514013

版权

命令行-termial 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

文章目录

需求：找出utf8文件中所有的中文
各系统集成的grep区别
- mac系统
- centos系统
经验
- mac如何才能做到查找文本中的所有中文呢？
mac安装GUN grep笔记
grep如何匹配文本中的所有中文字符呢？
grep匹配utf8中文说明
扩展，linux查看二进制，16进制的命令
参考

需求：找出utf8文件中所有的中文

处理一个项目代表时，有时候你需要找到项目中所有包含中文的文件。用sublime或文本编辑器可以轻松搞定。
这篇文章使用grep来匹配utf8文件的汉字。

各系统集成的grep区别

mac系统

/usr/local/var » grep --version                                                                  owen@Owen
grep (BSD grep) 2.5.1-FreeBSD

centos系统

[root@iZZ Command]# grep --version
grep (GNU grep) 2.20
Copyright (C) 2014 Free Software Foundation, Inc.
GPLv3+ 许可证: GNU 通用公共许可证第三版或更高版本 <http://gnu.org/licenses/gpl.html>。
这是自由软件: 您可自由更改并重新分发它。
在法律所允许的范围内,不附带任何担保条款。

作者 Mike Haertel 和 其余作者请参看 <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>。

经验

mac系统自带的grep是BSD grep，它不支持-P选项
GNU grep 才支持-P选项

-P 选项代表啥：PCRE (“Perl-compatible regular expressions”)

mac如何才能做到查找文本中的所有中文呢？

方式一：

brew install grep 替换原grep

方式二：

brew install pcre , 使用 pcregrep 而非grep

安装pcre要避免什么？
安装pcre要其支持utf8，否则输出将会是乱码
在这里插入图片描述

在mac系统自带的grep不要尝试什么16进制匹配来匹配所有中文，费时费力不能达目的，直接替换成GUN grep是正解

mac安装GUN grep笔记

brew install grep
在profile编辑PATH
source .profile使当前terminal就生效,或者重新启动一个terminal

brew install grep                                                               owen@Owen

All commands have been installed with the prefix "g".
If you need to use these commands with their normal names, you
can add a "gnubin" directory to your PATH from your bashrc like:
  PATH="/usr/local/opt/grep/libexec/gnubin:$PATH"
==> Summary
...
------------------------------------------------------------
» ls /usr/local/opt/grep/libexec/gnubin    
egrep fgrep grep
------------------------------------------------------------
 » /usr/local/opt/grep/libexec/gnubin/grep --version 
grep (GNU grep) 3.3
...

grep如何匹配文本中的所有中文字符呢？

前提是你使用GUN grep ,非BSD grep

grep --color='auto' -P -n "[\x80-\xFF]" file.xml
或者
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml

grep匹配utf8中文说明

假如有这样一个utf8文本

$str = '123abCD你和我';

字段串的二进制如下

00110001  =  1
00110010  =  2
00110011  =  3
01100001  =  a
01100010  =  b
01000011  =  C
01000100  =  D
11100100 10111101 10100000 = 你
11100101 10010010 10001100 = 和
11100110 10001000 10010001 = 我

grep是按byte字节匹配的内容的， utf8编码小于01111111的二进制还是ASCII编码的内容，也就是常用的0-9a-zA-Z和特殊字段不可打印字符等。
utf8的表示方式：
0xxxxxxx // 原ASCII字符的表示
110xxxxx 10xxxxxx // 非ASCII字符
1110xxxx 10xxxxxx 10xxxxxx // 非ASCII字符
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx // 非ASCII字符

第一个字节以“110”开头，后面跟一组字节以“10开头”；
第一个字节以“1110”开头，后面跟两组字节以“10”开头；
以前类推。

那么