程序员最近都爱上了这个网站  程序员们快来瞅瞅吧!  it98k网:it98k.com

本站消息

站长简介/公众号

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

暂无数据

How can I clean a year column with messy values?

发布于2024-12-08 09:57     阅读(744)     评论(0)     点赞(24)     收藏(3)


I have a project I'm working on for a data analysis course, where we pick a data set and go through the steps of cleaning and exploring the data with a question to answer in mind.

I want to be able to see how many instances of the data occur in different years, but right now the Year column in the data set is set to a datatype object, with values spanning from whole years like 1998, just the last 2 digits like 87, ranges of presumed years ('early 1990's', '89 or 90', '2011- 2012', 'approx 2001').

I'm trying to determine the best way to convert all these various instances to the proper format or would it be better to drop the values that are not definitive? I worry that this would lead to too much data loss because the dataset is already pretty small (about 5000 rows total).

I have looked into regex and it seems like that is the path I should go down to keep and alter the values, but I still don't understand it conceptually very well, and I worry about the efficiency of filtering for so many different value variations.

I'm still very new to Python and pandas.


解决方案


Assuming your Year columns are strings, I would write a normalize function like this:

import re
import pandas as pd

data = [
    {"year": "early 1990's"},
    {"year": "89 or 90"},
    {"year": "2011-2012"},
    {"year": "approx 2001"},
]

def normalize(row):
    year = row["year"]

    # Count the number of digits
    count = len(re.findall("\\d", year))

    if count == 4:
        # match YYYY
        if m := re.search("\\d\\d\\d\\d", year):
            return m.group(0)

    if count == 2:
        # match YY
        if m := re.search("\\d\\d", year):
            return "19" + m.group(0)

df = pd.DataFrame(data)
df["normalized"] = df.apply(normalize, axis=1)
print(df)

=>
           year normalized
0  early 1990's       1990
1      89 or 90       None
2     2011-2012       None
3   approx 2001       2001

The function returns None for unmatched pattern. You can list them as follows:

>>> print(df[df["normalized"].isnull()])
...
        year normalized
1   89 or 90       None
2  2011-2012       None

Review the output and modify the normalize function as you like. Repeat these steps until you get satisfied.



所属网站分类: 技术文章 > 问答

作者:黑洞官方问答小能手

链接:https://www.pythonheidong.com/blog/article/2046418/7ca0f238088b951c087d/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

24 0
收藏该文
已收藏

评论内容:(最多支持255个字符)