发布于2024-11-25 16:38 阅读(873) 评论(0) 点赞(29) 收藏(2)
我正在尝试清理间距不一致的 ASCII 数据集(例如
dataset =
\[1 1 1 1 1 1 1 1
1 1 1 1 1 1 4
2 1 1 1 1 1 1 1\])
但到目前为止,我尝试的方法都没有奏效。下面的代码批处理可以工作,但由于分隔符只是“”,它会用空格创建列。因此,它可以打印出文件的特征,例如
features = \[ nan nan nan nan 2
2 nan nan nan 1 nan
nan nan nan 3 nan\]
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader, random_split
import numpy as np
import os
class SDataset(Dataset):
def __init__(self, directory, delimiter=' '):
self.data = []
self.labels = []
self.delimiter = delimiter
# Loop through all files in the directory
for filename in os.listdir(directory):
if filename.endswith('.f16'):
file_path = os.path.join(directory, filename)
try:
# Load the ASCII data
df = pd.read_csv(file_path, delimiter=self.delimiter, header=None, engine='python')
# Ensure there are enough columns
if df.shape[1] >= 21:
# Extract features and labels
features = df.iloc[:, [12, 13]].values
labels = df.iloc[:, 6].values
self.data.append(features)
self.labels.append(labels)
else:
print(f"File {filename} does not have enough columns.")
except Exception as e:
print(f"Error loading file {filename}: {e}")
if not self.data:
raise ValueError("No valid data found.")
# Stack features
self.data = np.vstack(self.data)
# Concatenate labels
self.labels = np.concatenate(self.labels)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
features = torch.tensor(self.data[idx], dtype=torch.float32)
label = torch.tensor(self.labels[idx], dtype=torch.float32)
return features, label
# Load the full dataset from the directory containing .f16 files
dataset_directory = "C:/.../.../.../.../..."
dataset = SDataset(dataset_directory, delimiter=' ')
# Define the split sizes
train_size = int(0.7 * len(dataset)) # 70% for training
test_size = len(dataset) - train_size # 20% for testing
# Split the dataset
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])
# Create DataLoaders for training and testing
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
# Training loop
for batch_idx, (features, labels) in enumerate(train_loader):
print(f"Training Batch {batch_idx+1}")
print("Features:", features)
print("Labels:", labels)
# Testing loop
for batch_idx, (features, labels) in enumerate(test_loader):
print(f"Testing Batch {batch_idx+1}")
print("Features:", features)
print("Labels:", labels)
我尝试添加 .strip() 来看看它是否可以解决这个问题:
try:
# Load the ASCII data
df = pd.read_csv(file_path, delimiter=self.delimiter, header=None, engine='python')
# Ensure there are enough columns
if df.shape[1] >= 21:
# Extract features and labels
df = df.strip()
features = df.iloc[:, [12, 13]].values
labels = df.iloc[:, 6].values
self.data.append(features)
self.labels.append(labels)
else:
print(f"File {filename} does not have enough columns.")
我希望这能删除数据集内的空格,但却收到此错误:
Traceback (most recent call last):
File "c:---.py", line 51, in <module>
dataset = SDataset(dataset_directory, delimiter=' ')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:---.py", line 34, in __init__
raise ValueError("No valid data found. Ensure files contain the correct number of columns.")
ValueError: No valid data found. Ensure files contain the correct number of columns.
你可能会设置sep=r'\s+'
让read_csv
pandas 知道列之间由一个或多个空格字符分隔。例如让file.csv
内容为
A B C
1 2 3
4 5 6
然后
import pandas as pd
df = pd.read_csv('file.csv', sep=r'\s+')
print(df)
给出输出
A B C
0 1 2 3
1 4 5 6
请注意,您可能还会为sep
正则表达式模式提供不同的值。
(在 pandas 2.0.1 中测试)
作者:黑洞官方问答小能手
链接:https://www.pythonheidong.com/blog/article/2045887/c2c7e49f181ae03c2ac2/
来源:python黑洞网
任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任
昵称:
评论内容:(最多支持255个字符)
---无人问津也好,技不如人也罢,你都要试着安静下来,去做自己该做的事,而不是让内心的烦躁、焦虑,坏掉你本来就不多的热情和定力
Copyright © 2018-2021 python黑洞网 All Rights Reserved 版权所有,并保留所有权利。 京ICP备18063182号-1
投诉与举报,广告合作请联系vgs_info@163.com或QQ3083709327
免责声明:网站文章均由用户上传,仅供读者学习交流使用,禁止用做商业用途。若文章涉及色情,反动,侵权等违法信息,请向我们举报,一经核实我们会立即删除!