How can I clean a year column with messy values?-python黑洞网

本站消息

站长简介/公众号

出租广告位,需要合作请联系站长

黑洞官方问答小能手

1775783

文章

1060051274

访问

+关注

分类

暂无分类

日期归档

暂无数据

How can I clean a year column with messy values?

发布于2024-12-08 09:57 阅读(744) 评论(0) 点赞(24) 收藏(3)

I have a project I'm working on for a data analysis course, where we pick a data set and go through the steps of cleaning and exploring the data with a question to answer in mind.

I want to be able to see how many instances of the data occur in different years, but right now the Year column in the data set is set to a datatype object, with values spanning from whole years like 1998, just the last 2 digits like 87, ranges of presumed years ('early 1990's', '89 or 90', '2011- 2012', 'approx 2001').

I'm trying to determine the best way to convert all these various instances to the proper format or would it be better to drop the values that are not definitive? I worry that this would lead to too much data loss because the dataset is already pretty small (about 5000 rows total).

I have looked into regex and it seems like that is the path I should go down to keep and alter the values, but I still don't understand it conceptually very well, and I worry about the efficiency of filtering for so many different value variations.

I'm still very new to Python and pandas.

解决方案

Assuming your Year columns are strings, I would write a normalize function like this:

import re
import pandas as pd

data = [
    {"year": "early 1990's"},
    {"year": "89 or 90"},
    {"year": "2011-2012"},
    {"year": "approx 2001"},
]

def normalize(row):
    year = row["year"]

    # Count the number of digits
    count = len(re.findall("\\d", year))

    if count == 4:
        # match YYYY
        if m := re.search("\\d\\d\\d\\d", year):
            return m.group(0)

    if count == 2:
        # match YY
        if m := re.search("\\d\\d", year):
            return "19" + m.group(0)

df = pd.DataFrame(data)
df["normalized"] = df.apply(normalize, axis=1)
print(df)

=>
           year normalized
0  early 1990's       1990
1      89 or 90       None
2     2011-2012       None
3   approx 2001       2001

The function returns None for unmatched pattern. You can list them as follows:

>>> print(df[df["normalized"].isnull()])
...
        year normalized
1   89 or 90       None
2  2011-2012       None

Review the output and modify the normalize function as you like. Repeat these steps until you get satisfied.

所属网站分类: 技术文章 > 问答

作者：黑洞官方问答小能手

链接：https://www.pythonheidong.com/blog/article/2046418/7ca0f238088b951c087d/

来源：python黑洞网

任何形式的转载都请注明出处,如有侵权一经发现必将追究其法律责任

24 0

收藏该文

昵称:

评论内容：(最多支持255个字符)

---无人问津也好，技不如人也罢，你都要试着安静下来，去做自己该做的事，而不是让内心的烦躁、焦虑，坏掉你本来就不多的热情和定力

站长公众号(new) 更多>

中国程序员数量达755万，全球排名第二

为什么都说程序员找不到女朋友，但是身边程序猿的却没一个单身的？

笑话：一个测试工程师走进一家酒吧

笑话：面试官：请拿出一段体现你水平的代码。我： sudo rm -rf /*面试官：这体现了你哪方面能力？

python精选：Python 办公实战！按姓名拆分 Excel 为单独文件，微信自动发给相应联系人

程序人生：程序员如何实现财富自由？

找工作千万不要找外包？BAT互联网大厂外包亲身经历

程序员的工资这么高，为什么还会有人离职？

技术人必看的各类工具书籍

程序新人入职，如何才能快速上手呢？

pdf(new) 更多>

git常用命令pdf下载

《从零开始学Python网络爬虫》PDF高清版免费下载

《Python游戏编程快速上手》PDF高清版免费下载

【每日书籍推荐】PYTHON 项目开发实战_超高清PDF

《父与子的编程之旅：与小卡特一起学Python》PDF高清版免费下载

《Effective Python》pdf高清版下载

【每日推荐书籍】《Python3网络爬虫开发实战》

【每日一本书】《Python编程快速上手让繁琐工作自动化》

《Python从小白到大牛》PDF高清版免费下载

《Python编程：从入门到实践（第2版）》

脚本(new) 更多>

五年级同学BMI指数计算器

用python做---，pythonos2.2.0-1版

使用Discord.py个人号创建Discord Bot 提示 401 Unauthorized

新手训练短语

opencv实现视频截取

python练习题

用python画国旗

抖音最火表白代码下载

python文件解压脚本

python分类文件脚本下载

博客(new) 更多>

【豆包大模型】-Python调用豆包大模型API及文本转语音TTS

Python——Selenium快速上手+方法（一站式解决问题）

手把手教你打包Python项目为whl文件

Java之反射

PyCharm 2024的最新专业版安装和配置汉化教程-Python零基础教程！

【Python】Tkinter模块（巨详细）

【数据库】深入Redis与Python操作指南：高效内存存储与应用场景解析

学Python该看什么书？Python各阶段好书推荐，10年老码农倾囊相授！！

什么是代理IP_如何建立代理IP池？

python Tkinter详细基础教学:

视频(new) 更多>

2020最新_Python_(MySQL_SQL_Redis)数据库详解【千锋】

2019版-千锋Python语法-视频

2019千锋Python爬虫全套视频（最经典）

13天搞定Python分布爬虫视频教程

python办公自动化

python深度学习系列教程

python视频神经网络 Tensorflow 模块视频教程

初级Python视频教程云盘

初级Python视频教程推荐

python视频各种视频很多

实战(new) 更多>

韩顺平TCP网络文件传输课程代码word下载

基于python的ARP扫描与断网攻击的图形化脚本

bootstrap-datetimepicker搞了一天也不显示，请教！

这个项目是人人网的爬虫程序

这是一个利用Python分析一个json数据，并可视化输出结果的小项目

一个Mp3播放器 Python项目实战

一个自动发送邮箱验证码的小项目源码下载

python 加密解密的程序 .py下载

PYTHON实现计算机功能

PYTHON定时关机

问答(new) 更多>

Optimal scheduling and management of pumped hydro storage integrated with electric grid [closed]

Setting boost compilation options in conanfile.py

Aggregate function as an argument

How can I clean a year column with messy values?

Having trouble with rtl-sdr-v4 frequency scanner

Python generic type on function getting lost somewhere

How do I fix my closed form mathematical formula that returns 0 or 1 only whether or not theformula is true or false?

How to adjust the size of one subplot independently of other subplots in a matplotlib figure?

Performance of zeros function in Numpy

游戏(new) 更多>

用python写滑雪游戏源码下载

用python写乒乓球游戏源码下载

python吃豆子小游戏源码下载

外星人入侵 python小游戏源码下载

帅哥吃苹果 python小游戏源码下载

小恐龙快跑 python小游戏源码下载

python小游戏拼图源码下载

风筝 python小游戏源码下载

迷宫-python小游戏源码下载

python小游戏 life

其他资源(new) 更多>

王道机试

python做乘法口诀表

纯净版python 3.7开发环境安装包

视觉SLAM十四讲 - 从理论到实践

Python正则表达式教程下载

Vue+Go前端后端一体化企业级微服务网关项目

尚硅谷java基础入门视频下载

尚硅谷java基础入门视频

超基础初一生反utPython程序(只是一个文本)

注册表实用手册

程序员最近都爱上了这个网站程序员们快来瞅瞅吧！ it98k网:it98k.com

分类

标签

日期归档

How can I clean a year column with messy values?

解决方案

程序员最近都爱上了这个网站 程序员们快来瞅瞅吧！ it98k网:it98k.com

分类

标签

日期归档

How can I clean a year column with messy values?

解决方案

程序员最近都爱上了这个网站程序员们快来瞅瞅吧！ it98k网:it98k.com