今天处理阿里云的订单流水遇到这样一个场景: cat eee.txt
202087290320221
增量带宽
订单
2018-06-03 22:14:21
¥222.02
¥222.02
202085390830221
云服务器ECS(包月)
订单
2018-06-03 22:07:01
¥2552.00
¥2552.00
202088490000221
云服务器ECS(包月)
订单
2018-06-03 22:04:39
¥556.00
¥556.00
通过处理,要变成如下格式:
202087290320221|增量带宽|订单|2018-06-03 22:14:21|¥222.02|¥222.02
202085390830221|云服务器ECS(包月)|订单|2018-06-03 22:07:01|¥2552.00|¥2552.00
202088490000221|云服务器ECS(包月)|订单|2018-06-03 22:04:39|¥556.00|¥556.00
上述的数据格式,其实还是蛮有规律的,每个块包含六行数据,中间一个空行,然后是另外一个六行的数据块,依次类推
通过awk,我们只要改变awk的RS分隔符就可以实现,RS默认分隔符为"\n",我们可以将它换成 RS="" ,RS为空也即意味着使用空白行来分隔一行
RS=“” 的解析如下:
RS == "\n"
Records are separated by the newline character (`\n'). In effect, every line in the data file is a separate record, including blank lines. This is the default.
RS == any single character
Records are separated by each occurrence of the character. Multiple successive occurrences delimit empty records.
RS == ""
Records are separated by runs of blank lines. The newline character always serves as a field separator, in addition to whatever value FS may have. Leading and trailing newlines in a file are ignored.
RS == regexp
Records are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty records. (This is a gawk extension; it is not specified by the POSIX standard.)
原始数据一个数据块有六行,所以我么可以使用下面命令来格式化:
awk 'BEGIN{FS="\n";RS="";OFS="|"}{print $1,$2,$3,$4,$5,$6}' eee.txt
对于上面文本,上面命令已经达到我们目的,但上面命令不是通用的适配,比如下面文本:cat fff.txt
huanxgin
XIAN
711711
CC
HANGZHOU
399229
33
MM
chianzhonggua dddo
fdfdsf
Shanghai
888912
这些文本,每一个“块” 的行数是可变的,处理这类问题,我们通过循环语句把每个块的列给遍历出来,然后打印
awk 'BEGIN{FS="\n";RS="";ORS="|"} { for(x=1;x<NF;x++) { print $x "\t"} print $NF "\n"}' fff.txt |sed 's/^|//g'
x<NF ,的情况下,循环打印到 $(NF-1)列,最后把 $NF 附加到后面并换行。然后进入打印下一个RS="" 行。
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!