python常用模块收集 – 周周的讲义

Contents

1 patsy
2 PyMC3
3 pycassa
4 rpy2
5 PyCUDA
6 pyMPI和mpiutils
7 bs4
8 json和ujson
9 numpy
10 sqlite3
11 mysql
12 tarfile
13 os和shutil
14 argparse
15 pickle
16 h5py
17 msgpack

patsy

patsy is a Python library for describing statistical models and building Design Matrices using R-like formulas.

PyMC3

todo

pycassa

todo

rpy2

todo

PyCUDA

todo

pyMPI和mpiutils

todo

bs4

todo

json和ujson

常用的jason库的性能可参考 Benchmark of Python JSON libraries – Artem Krylysov，从文中可知，一般情况下使用内置的json库即可满足需求；若考虑很在意速度，可以考虑使用ujson
默认json示例代码（参考Python JSON）：

import json
data = [ { 'a' : 1, 'b' : 2, 'c' : 3, 'd' : 4, 'e' : 5 } ]
txt = json.dumps(data)      # dump
# print txt
obj = json.loads(txt)   # load
# print obj

ujson示例代码（参考ujson homepage）：

import ujson
ujson.dumps([{"key": "value"}, 81, True])
obj=ujson.loads('''[{"key": "value"}, 81, true]''')

numpy

参考：《Python数据分析.第二版.Armando.Fandango》armando-fandango/Python-Data-Analysis

m = np.array([np.arange(2), np.arange(2)])
m. dtype        # 查看数据类型
m.ndim      # 查看维度
m.dtype.itemsize
m.shape     # 查看维度
m.T             # 转置
m.asType        # 转换数组的数据类型（注意精度丢失）
a = np.arange(9)
a[:7:2]         # 访问数组到下标7（不包含7），每次递增2，也即0，2，6
a[::-1]         # 反转成员顺序
b = np.arange(24).reshape(2,3,4)    # 调整数组的形状
b.resize(2, 12)     # 与reshape类似（只改变原数组的视图）
b.ravel()       # 拆解压平为一维数组（只改变原数组的视图）
c = b.flatten()# 返回被压平的一维数组的拷贝
b.transpose()# 转置
np.concatenate((a, b), axis=1)  # 水平扩展（也即增加列）
np.concatenate((a, b), axis=0)  # 垂直扩展（也即增加行）
f = b.flat      # 获取迭代器
b.flat[2]       # 通过迭代器访问修改元素
b.flat[[1,3]]
b.flat = 0      # 修改所有成员为0

sqlite3

sqlite是SQL本地文件数据库，由于python内置了SQLite3，所以不需要安装即可直接使用。
示例代码参考寥雪峰的官方网站-使用SQLite

import sqlite3
conn = sqlite3.connect('test.db') # 如果文件不存在，会自动在当前目录创建:
cursor = conn.cursor()
cursor.execute('create table user (id varchar(20) primary key, name varchar(20))')
cursor.execute("insert into user (id, name) values ('1', 'Michael')")
print cursor.rowcount # 通过rowcount获得插入的行数:
cursor.close()
conn.commit()

cursor = conn.cursor()
cursor.execute('select * from user where id=?', ('1',))
values = cursor.fetchall() # 获得查询结果集:
print values   # 打印[(u'1', u'Michael')]
cursor.close()
conn.close()

sqlite3命令行程序
sqlite3 shell就是程序sqlite3常用使用方法如下(可以在命令行使用.help查看帮助信息)：

$ sqlite3 cnfdir.db 
SQLite version 3.7.17 2013-05-20 00:56:22
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .help
sqlite> .tables
user
sqlite> .schema user
CREATE TABLE user (id varchar(20) primary key, name varchar(20));
sqlite> select * from user;
1|Michael
sqlite>.quit

将mysql数据库转换为sqlite数据库
对于结构比较简单的数据可以使用 github上的mysql2sqlite工具

mysqldump --skip-extended-insert --compact [options]... DB_name > dump_mysql.sql
./mysql2sqlite dump_mysql.sql | sqlite3 mysqlite3.db

mysql

MySQL是Web世界中使用最广泛的数据库服务器。SQLite的特点是轻量级、可嵌入，但不能承受高并发访问，主要适合桌面和移动应用。而MySQL是为服务器端设计的数据库，能承受高并发访问，同时占用的内存也远远大于SQLite。此外，MySQL内部有多种数据库引擎，最常用的引擎是支持数据库事务的InnoDB。
现有两个python版本的MySQL驱动：
– mysql-connector-python：MySQL官方的纯Python驱动
– MySQL-python：封装了MySQL C驱动的Python驱动

示例代码参考 Python – MySQL Database Access

import MySQLdb
# open database connection
db = MySQLdb.connect("localhost","testuser","test123","TESTDB" )
# prepare a cursor object using cursor() method
cursor = db.cursor()
# execute SQL query using execute() method.
cursor.execute("SELECT VERSION()")
# Fetch a single row using fetchone() method.
data = cursor.fetchone()
print "Database version : %s " % data
cursor.close()
# disconnect from server
db.close()

tarfile

既然有压缩模块zipfile，那有归档模块tarfile也是很自然的。tarfile模块用于解包和打包文件，包括被gzip，bz2或lzma压缩后的打包文件。
如果是.zip类型的文件，建议使用zipfile模块，更高级的功能请使用shutil模块。
示例代码参考刘江的博客及教程-tarfile

import tarfile
tar = tarfile.open("sec3.lts.InetTongxing.20180704.0.tgz", "r:gz")
# list file name in tar ball
print tar.getnames()
for tarinfo in tar:
    print tarinfo.name, "is", tarinfo.size, "bytes in size and is",
    if tarinfo.isreg(): print "a regular file."
    elif tarinfo.isdir(): print "a directory."
    else: print "something else."
# extract content from tarball
f=tar.extractfile('ph600519.pkl')
print file.read()
# close the tar ball
tar.close()

# create tar ball
tar = tarfile.open("sec3.lts.InetTongxing.20180704.0.tgz", "w:gz")
for name in ["ph600518.pkl", "ph600519.pkl", "ph600887"]:
    tar.add(name)
tar.close()

os和shutil

shutil 按照字面可以理解为shell utility，是一种高层次的文件操作工具，类似于高级API，而且主要强大之处在于其对文件的复制与删除操作更是比较支持好。用起来很简单，不需要你自己再去调用底层的os。

# shutil 模块
shutil.copyfile( src, dst)    从源src复制到dst中去。
shutil.move( src, dst)        移动文件或重命名
shutil.copymode( src, dst)    只是会复制其权限其他的东西是不会被复制的
shutil.copystat( src, dst)    复制权限、最后访问时间、最后修改时间
shutil.copy( src, dst)        复制一个文件到一个文件或一个目录
shutil.copy2( src, dst)        在copy上的基础上再复制文件最后访问时间与修改时间也复制过来了，类似于cp –p的东西
shutil.copy2( src, dst)        如果两个位置的文件系统是一样的话相当于是rename操作，只是改名；如果是不在相同的文件系统的话就是做move操作
shutil.copytree( olddir, newdir, True/Flase)
把olddir拷贝一份newdir，如果第3个参数是True，则复制目录时将保持文件夹下的符号连接，如果第3个参数是False，则将在复制的目录下生成物理副本来替代符号连接
shutil.rmtree( src )    递归删除一个目录以及目录内的所有内容

# os 模块
os.sep    可以取代操作系统特定的路径分隔符。windows下为 '\\'
os.name    字符串指示你正在使用的平台。比如对于Windows是'nt'，而对于Linux/Unix用户是 'posix'
os.getcwd()    函数得到当前工作目录，即当前Python脚本工作的目录路径
os.getenv()    获取一个环境变量，如果没有返回none
os.putenv(key, value)    设置一个环境变量值
os.listdir(path)    返回指定目录下的所有文件和目录名
os.remove(path)    函数用来删除一个文件
os.system(command)    函数用来运行shell命令
os.linesep    字符串给出当前平台使用的行终止符。例如，Windows使用 '\r\n'，Linux使用 '\n' 而Mac使用 '\r'
os.path.split(path)        函数返回一个路径的目录名和文件名
os.path.isfile()    和os.path.isdir()函数分别检验给出的路径是一个文件还是目录
os.path.exists()    函数用来检验给出的路径是否真地存在
os.curdir        返回当前目录 ('.')
os.mkdir(path)    创建一个目录
os.makedirs(path)    递归的创建目录
os.chdir(dirname)    改变工作目录到dirname
os.path.getsize(name)    获得文件大小，如果name是目录返回0L
os.path.abspath(name)    获得绝对路径
os.path.normpath(path)    规范path字符串形式
os.path.splitext()        分离文件名与扩展名
os.path.join(path,name)    连接目录与文件名或目录
os.path.basename(path)    返回文件名
os.path.dirname(path)    返回文件路径
os.walk(top,topdown=True,onerror=None)        遍历迭代目录
os.rename(src, dst)        重命名file或者directory src到dst
os.renames(old, new)    递归重命名文件夹或者文件。像rename()

argparse

getopt，optparse，argparse都为了方便的读取命令参数。getopt是比较简单的版本。

optparse模块从2.7开始被废弃，建议使用argparse。

示例代码参考 python argparse用法总结和 formatter_class section

import argparse,textwrap

# test customized desc info (with multiple lines and ident)
parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter,
    description=textwrap.dedent('''\
         Please do not mess up this text!
         --------------------------------
             I have indented it
             exactly the way
             I want it
         '''))
# test store_true as switch
parser.add_argument("-v", "--verbosity", action='store_true', help="output verbosity")
# test change default type str to type int
parser.add_argument("-l", "--verboseLevel", type=int, help="set verbosity level")
# test choices
parser.add_argument("-r", "--verboseLevelRange", type=int, choices=[0,1,3], help="set verbosity permissioned level")
# test default value
parser.add_argument("-d", "--level", type=int, default=2, help="set verbosity level")
# test exclusive group 
group = parser.add_mutually_exclusive_group()
group.add_argument("-t", "--verbose", action="store_true")
group.add_argument("-q", "--quiet", action="store_true")

args = parser.parse_args()
if args.verbosity:
        print "verbosity turned on"
if args.verboseLevel:
    print "verbosity level: # %d" % (args.verboseLevel)
if args.verboseLevelRange:
    print "verbosity permissioned level: # %d" % (args.verboseLevelRange)
if args.level:
    print "level: # %d" % (args.level)

pickle

pickle是Python库中常用的序列化工具，新版本的Python中用c重新实现了一遍，叫cPickle，性能更高。
示例代码参考自Python常用库-pickle

# 0：ASCII protocol，兼容旧版本的Python
# 1：binary format，兼容旧版本的Python
# 2：binary format，Python2.3 之后才有，更好的支持new-sytle class
print 'HIGHEST_PROTOCOL: ', pickle.HIGHEST_PROTOCOL
t = {'name': ['v1', 'v2']}
print t
o = pickle.dumps(t, pickle.HIGHEST_PROTOCOL)    # dump to string 
print 'len o: ', len(o)
p = pickle.loads(o)     # load from string
print p

# 将内存对象序列化后直接dump到文件或支持文件接口的对象中
with open('test.bin', 'wb') as fp:
    pickle.dump(t, fp, pickle.HIGHEST_PROTOCOL)     # dump to binary file
with open('test.bin', 'rb') as fp:
    p = pickle.load(fp)     # load from binary file
    print p

# Pickler/Unpickler
# Pickler(file, protocol).dump(obj) 等价于 pickle.dump(obj, file[, protocol])
# Unpickler(file).load() 等价于 pickle.load(file)
# Pickler/Unpickler 封装性更好，可以很方便的替换file
f = file('test.bin', 'wb')
pick = pickle.Pickler(f, pickle.HIGHEST_PROTOCOL)
pick.dump(t)
f.close()
f = file('test.bin', 'rb')
unpick = pickle.Unpickler(f)
p = unpick.load()
print p
f.close()

pandas使用pickle载入/载出数据
示例代码参考：pandas.DataFrame.to_pickle

# 默认使用 HIGHEST_PROTOCOL级别进行压缩
original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
print original_df
original_df.to_pickle("./dummy.pkl")

unpickled_df = pd.read_pickle("./dummy.pkl")
unpickled_df

h5py

参考HDF5在python上的使用
安装依赖库sudo pip install h5py

import h5py  #导入工具包
import numpy as np
#HDF5的写入：
imgData = np.zeros((30,3,128,256))
f = h5py.File('hdf.h5','w')   #创建一个h5文件，文件指针是f
f['data'] = imgData                 #将数据写入文件的主键data下面
f['labels'] = range(100)            #将数据写入文件的主键labels下面
f.close()                           #关闭文件

#HDF5的读取：
f = h5py.File('hdf.h5','r')   #打开h5文件
f.keys()                            #可以查看所有的主键
a = f['data'][:]                    #取出主键为data的所有的键值
f.close()

可以使用命令行工具查看文件的内容：

h5dump hdf.h5

pandas使用hdf5载入/载出数据
安装依赖库sudo pip install tables
参考pandas.DataFrame.to_hdf

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df.to_hdf('data.h5', key='df', mode='w')
s = pd.Series([1, 2, 3, 4])
s.to_hdf('data.h5', key='s')

pd.read_hdf('data.h5', 'df')
pd.read_hdf('data.h5', 's')

msgpack

msgpack官方网站
安装依赖库sudo pip install msgpack
示例代码参考Python中msgpack库的使用

import msgpack
var = {'a': 'this', 'b': 'is', 'c': 'a test'}
with open('data.txt', 'wb') as f1:
    msgpack.dump(var, f1) # 存储数据
with open('data.txt', 'rb') as f2:
    var = msgpack.load(f2, use_list=False, encoding='utf-8') # 读取数据
print(var)

pandas使用msgpack载入/载出数据
参考pandas-msgpack

Pandas官方声明，该库还处于实验性阶段：This is a very new feature of pandas. We intend to provide certain optimizations in the io of the msgpack data. Since this is marked as an EXPERIMENTAL LIBRARY, the storage format may not be stable until a future release.

df = pd.DataFrame(np.random.rand(5, 2), columns=list('AB'))
df.to_msgpack('foo.msg')
pd.read_msgpack('foo.msg')

patsy

PyMC3

pycassa

rpy2

PyCUDA

pyMPI和mpiutils

bs4

json和ujson

numpy

sqlite3

mysql

tarfile

os和shutil

argparse

pickle

h5py

msgpack

发表评论 取消回复

发表评论取消回复