Python 脚本读取文本文件中不带 CR 的 LF 并将其替换为另一个字符-python黑洞网

本站消息

站长简介/公众号

出租广告位,需要合作请联系站长

黑洞官方问答小能手

1776142

文章

1060908618

访问

+关注

分类

暂无分类

日期归档

暂无数据

Python 脚本读取文本文件中不带 CR 的 LF 并将其替换为另一个字符

发布于2024-11-23 21:37 阅读(700) 评论(0) 点赞(19) 收藏(1)

我从 FFIEC 网站 ( https://cdr.ffiec.gov/public/PWS/DownloadBulkData.aspx ) <呼叫报告 - 单个期间，计划 RIE>获得了几个制表符分隔的文本文件，这些文件在一个或多个字段中有 LF（换行符）字符而没有 CR（回车符）。这些文件正在上传到 SQL Server 2022（或在 Excel 中使用）。文件的每个记录（行）都以 CRLF 序列结尾。问题是，当读取文本文件时（在 Excel 中或使用 SSIS 导入到 SQL Server），字段中的 LF 被解释为开始下一个记录。

我知道 Windows 中的 \r\n 与 UNIX/Linux 中的 \n 不同，并且怀疑 Python 会将它们作为序列处理。我还没有尝试过 Latin-1 或 cp1252 编码。

我正在运行 Windows 11 Pro。该脚本是从 shell 命令（SQL 存储过程或 Excel VBA）调用的，并且是用于清理要导入的文件的一组较大脚本的一部分。

我尝试的解决方案是读取文件，一次遍历一个字符，找到前面没有 CR‘\r’的 LF‘\n’，并将其替换为分号‘；’。

Python代码（v3.12）：

import sys

def stripLFwoCR_file(file_path):
    # Read the entire file contents
    with open(file_path, 'r', encoding='utf-8') as file:
        input_data = file.read()

    # Initialize output
    output_data = []

    # Iterate input content 1 character at a time
    # Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
    i = 0
    while i < len(input_data):
        if input_data[i] == '\n':
            # If previous character is not '\r' then replace '\n' with ';'
            if i == 0 or input_data[i-1] != '\r':
                output_data.append(';')
            # Skip this '\n'
        else:
            output_data.append(input_data[i])
        i += 1

    # Write the modified content back to the file, overwriting it
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(''.join(output_data))

if __name__ == "__main__":
    args = sys.argv
    # args[0] = current file
    # args[1] = function name
    # args[2:] = function args : (*unpacked)
    globals()[args[1]](*args[2:])

遇到的问题是脚本将文件中的所有 LF 和所有 CRLF 替换为‘；’。

显示原始文档的示例（LF，无 CR）第 10-14 行属于同一记录。第 16-21 行是一条记录。

更新：我需要阅读手册！自 3.x 以来，Python 有一个选项可以忽略或使用不同的换行自动替换。我的原始代码在 while 循环中也有一个逻辑错误。

I ended up using this because it required less re-writes to the rest of my code. I did test the answer from @JRiggles and marked it as the solution (cleaner, less code):

import sys

def stripLFwoCR_file(file_path):
    # Read the entire file contents
    with open(file_path, 'r', encoding='utf-8', newline='\r\n') as file:
        input_data = file.read()

    # Initialize output
    output_data = []

    # Iterate input content 1 character at a time
    # Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
    i = 0
    while i < len(input_data):
        if input_data[i] == '\n':
            # If previous character is not '\r' then replace '\n' with ';'
            if i == 0 or input_data[i-1] != '\r':
                # Skip this '\n' and replace
                output_data.append(';')
            else:
                output_data.append(input_data[i])
        else:
            output_data.append(input_data[i])
        i += 1

    # Write the modified content back to the file, overwriting it
    with open(file_path, 'w', encoding='utf-8', newline='\n') as file:
        file.write(''.join(output_data))

if __name__ == "__main__":
    args = sys.argv
    # args[0] = current file
    # args[1] = function name
    # args[2:] = function args : (*unpacked)
    globals()[args[1]](*args[2:])

解决方案

This sounds like a job for re.sub. The pattern (?<!\r)\n will match any LF characters \n which aren't preceded by a carriage return (CR) \r.

Here's a sample file, sample data.txt (screenshot showing line endings)

To avoid any line ending conversions, open the file in binary read mode 'rb'

import re


pattern = b'(?<!\r)\n'  # match any \n not preceded by \r

with open(r'<path to>\sample data.txt', 'rb') as file:
    data = file.read()
    print('Pre-substitution: ', data)
    # replace any matches with a semicolon ';'
    result = re.sub(pattern, b';', data)
    print('Post-substitution: ', result)

This prints:

Pre-substitution:  b'this line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\n'
Post-substitution:  b'this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\n'

It's worth mentioning that consecutive \ns will all be substituted, so \n\n\n becomes ;;; and \r\n\n becomes r\n;.

Note also that the pattern string and substitution value are both bytestrings (b'<str>') - if you don't do this, you'll get a TypeError!