发布于2024-12-08 09:57 阅读(744) 评论(0) 点赞(24) 收藏(3)
I have a project I'm working on for a data analysis course, where we pick a data set and go through the steps of cleaning and exploring the data with a question to answer in mind.
I want to be able to see how many instances of the data occur in different years, but right now the Year
column in the data set is set to a datatype object, with values spanning from whole years like 1998, just the last 2 digits like 87, ranges of presumed years ('early 1990's', '89 or 90', '2011- 2012', 'approx 2001').
I'm trying to determine the best way to convert all these various instances to the proper format or would it be better to drop the values that are not definitive? I worry that this would lead to too much data loss because the dataset is already pretty small (about 5000 rows total).
I have looked into regex and it seems like that is the path I should go down to keep and alter the values, but I still don't understand it conceptually very well, and I worry about the efficiency of filtering for so many different value variations.
I'm still very new to Python and pandas.
Assuming your Year
columns are strings, I would write a normalize function like this:
import re
import pandas as pd
data = [
{"year": "early 1990's"},
{"year": "89 or 90"},
{"year": "2011-2012"},
{"year": "approx 2001"},
]
def normalize(row):
year = row["year"]
# Count the number of digits
count = len(re.findall("\\d", year))
if count == 4:
# match YYYY
if m := re.search("\\d\\d\\d\\d", year):
return m.group(0)
if count == 2:
# match YY
if m := re.search("\\d\\d", year):
return "19" + m.group(0)
df = pd.DataFrame(data)
df["normalized"] = df.apply(normalize, axis=1)
print(df)
=>
year normalized
0 early 1990's 1990
1 89 or 90 None
2 2011-2012 None
3 approx 2001 2001
The function returns None
for unmatched pattern. You can list them as follows:
>>> print(df[df["normalized"].isnull()])
...
year normalized
1 89 or 90 None
2 2011-2012 None
Review the output and modify the normalize function as you like. Repeat these steps until you get satisfied.
作者:黑洞官方问答小能手
链接:https://www.pythonheidong.com/blog/article/2046418/7ca0f238088b951c087d/
来源:python黑洞网
任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任
昵称:
评论内容:(最多支持255个字符)
---无人问津也好,技不如人也罢,你都要试着安静下来,去做自己该做的事,而不是让内心的烦躁、焦虑,坏掉你本来就不多的热情和定力
Copyright © 2018-2021 python黑洞网 All Rights Reserved 版权所有,并保留所有权利。 京ICP备18063182号-1
投诉与举报,广告合作请联系vgs_info@163.com或QQ3083709327
免责声明:网站文章均由用户上传,仅供读者学习交流使用,禁止用做商业用途。若文章涉及色情,反动,侵权等违法信息,请向我们举报,一经核实我们会立即删除!