程序员最近都爱上了这个网站  程序员们快来瞅瞅吧!  it98k网:it98k.com

本站消息

站长简介/公众号

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

暂无数据

如何在预处理数据时跳过连续分隔符

发布于2024-11-25 16:38     阅读(873)     评论(0)     点赞(29)     收藏(2)


我正在尝试清理间距不一致的 ASCII 数据集(例如

dataset =

\[1 1  1        1  1    1 1  1

1 1 1     1   1     1    4

    2    1    1  1   1  1 1 1\])

但到目前为止,我尝试的方法都没有奏效。下面的代码批处理可以工作,但由于分隔符只是“”,它会用空格创建列。因此,它可以打印出文件的特征,例如

features = \[ nan nan nan nan 2

                   2 nan nan nan 1 nan
    
                   nan nan nan 3 nan\]
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader, random_split
import numpy as np
import os

class SDataset(Dataset):
    def __init__(self, directory, delimiter=' '):
        self.data = []
        self.labels = []
        self.delimiter = delimiter

        # Loop through all files in the directory
        for filename in os.listdir(directory):
            if filename.endswith('.f16'):
                file_path = os.path.join(directory, filename)
                try:
                    # Load the ASCII data 
                    df = pd.read_csv(file_path, delimiter=self.delimiter, header=None, engine='python')
                    # Ensure there are enough columns
                    if df.shape[1] >= 21:
                        # Extract features and labels
                        features = df.iloc[:, [12, 13]].values
                        labels = df.iloc[:, 6].values  
                        self.data.append(features)
                        self.labels.append(labels)
                    else:
                        print(f"File {filename} does not have enough columns.")
                except Exception as e:
                    print(f"Error loading file {filename}: {e}")

        if not self.data:
            raise ValueError("No valid data found.")

        # Stack features
        self.data = np.vstack(self.data)  
        # Concatenate labels
        self.labels = np.concatenate(self.labels)  

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        features = torch.tensor(self.data[idx], dtype=torch.float32)
        label = torch.tensor(self.labels[idx], dtype=torch.float32)
        return features, label

# Load the full dataset from the directory containing .f16 files
dataset_directory = "C:/.../.../.../.../..."
dataset = SDataset(dataset_directory, delimiter=' ')

# Define the split sizes
train_size = int(0.7 * len(dataset))  # 70% for training
test_size = len(dataset) - train_size  # 20% for testing

# Split the dataset
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Create DataLoaders for training and testing
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Training loop
for batch_idx, (features, labels) in enumerate(train_loader):
    print(f"Training Batch {batch_idx+1}")
    print("Features:", features)
    print("Labels:", labels)

# Testing loop
for batch_idx, (features, labels) in enumerate(test_loader):
    print(f"Testing Batch {batch_idx+1}")
    print("Features:", features)
    print("Labels:", labels)

我尝试添加 .strip() 来看看它是否可以解决这个问题:

try:
    # Load the ASCII data 
    df = pd.read_csv(file_path, delimiter=self.delimiter, header=None, engine='python')
    # Ensure there are enough columns
    if df.shape[1] >= 21:
        # Extract features and labels
        df = df.strip()
        features = df.iloc[:, [12, 13]].values
        labels = df.iloc[:, 6].values  
        self.data.append(features)
        self.labels.append(labels)
    else:
        print(f"File {filename} does not have enough columns.")

我希望这能删除数据集内的空格,但却收到此错误:

Traceback (most recent call last):
  File "c:---.py", line 51, in <module>
    dataset = SDataset(dataset_directory, delimiter=' ')
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:---.py", line 34, in __init__
    raise ValueError("No valid data found. Ensure files contain the correct number of columns.")
ValueError: No valid data found. Ensure files contain the correct number of columns.

解决方案


你可能会设置sep=r'\s+'read_csvpandas 知道列之间由一个或多个空格字符分隔。例如让file.csv内容为

A  B  C
1 2   3
4   5 6

然后

import pandas as pd
df = pd.read_csv('file.csv', sep=r'\s+')
print(df)

给出输出

   A  B  C
0  1  2  3
1  4  5  6

请注意,您可能还会为sep正则表达式模式提供不同的值。

(在 pandas 2.0.1 中测试)



所属网站分类: 技术文章 > 问答

作者:黑洞官方问答小能手

链接:https://www.pythonheidong.com/blog/article/2045887/c2c7e49f181ae03c2ac2/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

29 0
收藏该文
已收藏

评论内容:(最多支持255个字符)