程序员最近都爱上了这个网站  程序员们快来瞅瞅吧!  it98k网:it98k.com

本站消息

站长简介/公众号

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

暂无数据

Python 脚本读取文本文件中不带 CR 的 LF 并将其替换为另一个字符

发布于2024-11-23 21:37     阅读(640)     评论(0)     点赞(19)     收藏(1)


我从 FFIEC 网站 ( https://cdr.ffiec.gov/public/PWS/DownloadBulkData.aspx ) <呼叫报告 - 单个期间,计划 RIE>获得了几个制表符分隔的文本文件,这些文件在一个或多个字段中有 LF(换行符)字符而没有 CR(回车符)。这些文件正在上传到 SQL Server 2022(或在 Excel 中使用)。文件的每个记录(行)都以 CRLF 序列结尾。问题是,当读取文本文件时(在 Excel 中或使用 SSIS 导入到 SQL Server),字段中的 LF 被解释为开始下一个记录。

我知道 Windows 中的 \r\n 与 UNIX/Linux 中的 \n 不同,并且怀疑 Python 会将它们作为序列处理。我还没有尝试过 Latin-1 或 cp1252 编码。

我正在运行 Windows 11 Pro。该脚本是从 shell 命令(SQL 存储过程或 Excel VBA)调用的,并且是用于清理要导入的文件的一组较大脚本的一部分。

我尝试的解决方案是读取文件,一次遍历一个字符,找到前面没有 CR‘\r’的 LF‘\n’,并将其替换为分号‘;’。

Python代码(v3.12):

import sys

def stripLFwoCR_file(file_path):
    # Read the entire file contents
    with open(file_path, 'r', encoding='utf-8') as file:
        input_data = file.read()

    # Initialize output
    output_data = []

    # Iterate input content 1 character at a time
    # Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
    i = 0
    while i < len(input_data):
        if input_data[i] == '\n':
            # If previous character is not '\r' then replace '\n' with ';'
            if i == 0 or input_data[i-1] != '\r':
                output_data.append(';')
            # Skip this '\n'
        else:
            output_data.append(input_data[i])
        i += 1

    # Write the modified content back to the file, overwriting it
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(''.join(output_data))

if __name__ == "__main__":
    args = sys.argv
    # args[0] = current file
    # args[1] = function name
    # args[2:] = function args : (*unpacked)
    globals()[args[1]](*args[2:])

遇到的问题是脚本将文件中的所有 LF 和所有 CRLF 替换为‘;’。

显示原始文档的示例(LF,无 CR) 第 10-14 行属于同一记录。第 16-21 行是一条记录。

更新:我需要阅读手册!自 3.x 以来,Python 有一个选项可以忽略或使用不同的换行自动替换。我的原始代码在 while 循环中也有一个逻辑错误。

I ended up using this because it required less re-writes to the rest of my code. I did test the answer from @JRiggles and marked it as the solution (cleaner, less code):

import sys

def stripLFwoCR_file(file_path):
    # Read the entire file contents
    with open(file_path, 'r', encoding='utf-8', newline='\r\n') as file:
        input_data = file.read()

    # Initialize output
    output_data = []

    # Iterate input content 1 character at a time
    # Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
    i = 0
    while i < len(input_data):
        if input_data[i] == '\n':
            # If previous character is not '\r' then replace '\n' with ';'
            if i == 0 or input_data[i-1] != '\r':
                # Skip this '\n' and replace
                output_data.append(';')
            else:
                output_data.append(input_data[i])
        else:
            output_data.append(input_data[i])
        i += 1

    # Write the modified content back to the file, overwriting it
    with open(file_path, 'w', encoding='utf-8', newline='\n') as file:
        file.write(''.join(output_data))

if __name__ == "__main__":
    args = sys.argv
    # args[0] = current file
    # args[1] = function name
    # args[2:] = function args : (*unpacked)
    globals()[args[1]](*args[2:])

解决方案


This sounds like a job for re.sub. The pattern (?<!\r)\n will match any LF characters \n which aren't preceded by a carriage return (CR) \r.

Here's a sample file, sample data.txt (screenshot showing line endings)

在此处输入图片描述

To avoid any line ending conversions, open the file in binary read mode 'rb'

import re


pattern = b'(?<!\r)\n'  # match any \n not preceded by \r

with open(r'<path to>\sample data.txt', 'rb') as file:
    data = file.read()
    print('Pre-substitution: ', data)
    # replace any matches with a semicolon ';'
    result = re.sub(pattern, b';', data)
    print('Post-substitution: ', result)

This prints:

Pre-substitution:  b'this line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\n'
Post-substitution:  b'this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\n'

It's worth mentioning that consecutive \ns will all be substituted, so \n\n\n becomes ;;; and \r\n\n becomes r\n;.

Note also that the pattern string and substitution value are both bytestrings (b'<str>') - if you don't do this, you'll get a TypeError!



所属网站分类: 技术文章 > 问答

作者:黑洞官方问答小能手

链接:https://www.pythonheidong.com/blog/article/2045465/8d3551363f0627202685/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

19 0
收藏该文
已收藏

评论内容:(最多支持255个字符)