运维开发网
广告位招商联系QQ:123077622
 
广告位招商联系QQ:123077622

regex – sed – 删除大型csv文件中引号内的引号

运维开发网 https://www.qedev.com 2020-04-09 11:53 出处:网络
我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式. 我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据:
我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式.

我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据:

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"

所需的输出是:

1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

我一直在寻找帮助,但我没有太接近解决方案,我尝试了以下seds与正则表达式模式:

sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt

这些来自以下问题,但似乎不适用于sed:

Related question for perl

Related question for SISS

原始文件是* .txt,我正在尝试用sed编辑它们.

这是使用GNU awk和 FPAT变量的一种方法:

gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file

结果:

1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

说明:

Using FPAT, a field is defined as either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote”. Then on every line of input, loop through each field and if the field starts and ends with a double quote, remove all quotes from the field. Finally, add double quotes surrounding the field.

扫码领视频副本.gif

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号