发布于2024-11-23 21:37 阅读(640) 评论(0) 点赞(19) 收藏(1)
我从 FFIEC 网站 ( https://cdr.ffiec.gov/public/PWS/DownloadBulkData.aspx ) <呼叫报告 - 单个期间,计划 RIE>获得了几个制表符分隔的文本文件,这些文件在一个或多个字段中有 LF(换行符)字符而没有 CR(回车符)。这些文件正在上传到 SQL Server 2022(或在 Excel 中使用)。文件的每个记录(行)都以 CRLF 序列结尾。问题是,当读取文本文件时(在 Excel 中或使用 SSIS 导入到 SQL Server),字段中的 LF 被解释为开始下一个记录。
我知道 Windows 中的 \r\n 与 UNIX/Linux 中的 \n 不同,并且怀疑 Python 会将它们作为序列处理。我还没有尝试过 Latin-1 或 cp1252 编码。
我正在运行 Windows 11 Pro。该脚本是从 shell 命令(SQL 存储过程或 Excel VBA)调用的,并且是用于清理要导入的文件的一组较大脚本的一部分。
我尝试的解决方案是读取文件,一次遍历一个字符,找到前面没有 CR‘\r’的 LF‘\n’,并将其替换为分号‘;’。
import sys
def stripLFwoCR_file(file_path):
# Read the entire file contents
with open(file_path, 'r', encoding='utf-8') as file:
input_data = file.read()
# Initialize output
output_data = []
# Iterate input content 1 character at a time
# Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
i = 0
while i < len(input_data):
if input_data[i] == '\n':
# If previous character is not '\r' then replace '\n' with ';'
if i == 0 or input_data[i-1] != '\r':
# Skip this '\n'
i += 1
# Write the modified content back to the file, overwriting it
with open(file_path, 'w', encoding='utf-8') as file:
if __name__ == "__main__":
args = sys.argv
# args[0] = current file
# args[1] = function name
# args[2:] = function args : (*unpacked)
遇到的问题是脚本将文件中的所有 LF 和所有 CRLF 替换为‘;’。
显示原始文档的示例(LF,无 CR) 第 10-14 行属于同一记录。第 16-21 行是一条记录。
更新:我需要阅读手册!自 3.x 以来,Python 有一个选项可以忽略或使用不同的换行自动替换。我的原始代码在 while 循环中也有一个逻辑错误。
I ended up using this because it required less re-writes to the rest of my code. I did test the answer from @JRiggles and marked it as the solution (cleaner, less code):
import sys
def stripLFwoCR_file(file_path):
# Read the entire file contents
with open(file_path, 'r', encoding='utf-8', newline='\r\n') as file:
input_data = file.read()
# Initialize output
output_data = []
# Iterate input content 1 character at a time
# Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
i = 0
while i < len(input_data):
if input_data[i] == '\n':
# If previous character is not '\r' then replace '\n' with ';'
if i == 0 or input_data[i-1] != '\r':
# Skip this '\n' and replace
i += 1
# Write the modified content back to the file, overwriting it
with open(file_path, 'w', encoding='utf-8', newline='\n') as file:
if __name__ == "__main__":
args = sys.argv
# args[0] = current file
# args[1] = function name
# args[2:] = function args : (*unpacked)
This sounds like a job for re.sub
. The pattern (?<!\r)\n
will match any LF characters \n
which aren't preceded by a carriage return (CR) \r
Here's a sample file, sample data.txt
(screenshot showing line endings)
To avoid any line ending conversions, open the file in binary read mode 'rb'
import re
pattern = b'(?<!\r)\n' # match any \n not preceded by \r
with open(r'<path to>\sample data.txt', 'rb') as file:
data = file.read()
print('Pre-substitution: ', data)
# replace any matches with a semicolon ';'
result = re.sub(pattern, b';', data)
print('Post-substitution: ', result)
This prints:
Pre-substitution: b'this line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\n'
Post-substitution: b'this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\n'
It's worth mentioning that consecutive \n
s will all be substituted, so \n\n\n
becomes ;;;
and \r\n\n
becomes r\n;
Note also that the pattern
string and substitution value are both bytestrings (b'<str>'
) - if you don't do this, you'll get a TypeError
任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任
Copyright © 2018-2021 python黑洞网 All Rights Reserved 版权所有,并保留所有权利。 京ICP备18063182号-1