2009年1月30日星期五

使用libsvm进行SVR

昨天最大的收获就是明了了原来SVM中的SVC对我没用，因为它需要将数据分好类，对我有用的是SVR，用它来做回归，也就是拟合，效果据说不错～
今天就正式开始动手了，先简单的写了一个数据转换的类（以后要把所有的数据格式转换都放到这个类里面去，用不同的函数来调用），将模糊推理里用的数据转为libsvm指定的格式。然后在easy.py里在Cross validation和后面的Training里添加上了"-s 3"，用教程里说的"python easy.py c-z-1.svm"来处理数据，结果是：“Best c=32768.0, g=0.5 CV rate=8.0785”！看起来好像不太妙，虽然还不是很清楚这里的CV rate是什么意思，但前面的c居然等于32768这么大的数，肯定是出问题了～
上网继续搜libsvm的使用，发现还有个gridregression.py的程序，好像是专门针对SVR进行grid搜索的。下载一看，就是在grid.py的基础上修改的，好像就是添加了一个-p，从而是搜索c,g,p三维空间，这下子搜索量就更大了！看来搞一个启发式的搜索最佳参数的算法还是很有必要的啊～
改了gridregression.py中的路径（后来发现其实不用改，因为用easy.py调用时会传入路径），在easy.py中也修改了相应的调用命令，得出的c，g，p都正常许多，CV rate值变成了一点多，看来是越小越好啊～
修改后的gridregression.py和easy.py都发到发芽网上去了。以后还要写个启发式搜索c,g,p的算法，不然运算量太大了，算一次得老半天，这可不行啊！

2009年1月29日星期四

Stifled Laughter: How the Communist Party Killed Chinese Humor

对于树来说它就是一只猛兽！

2009年1月25日星期日

学会如何在python中从源码编译安装第三方的模块

以前安装第三方模块时也从源码安装过，不过那些源码都是python的源码，不牵扯到c语言的编译。凡是牵扯到c语言编译的，如libsvm，都会说我没有安装vs2003，所以无法编译安装。每次遇到这种情况，我就只好去下一个Windows下的exe安装文件来装，这经常不是最新的版本！
今天决定搞定这个问题。去网上搜了一下，发现其实只要用这个命令“python setup.py build --compiler=mingw32 install”就可以了～
用libsvm试了一下，开始遇到问题，编译发生错误。仔细看看，原来是缺少c语言的源文件。从上一级目录拷贝源文件到python目录下，再次执行就可以了。

BTW，删除这种方式安装的第三方模块该如何进行呢？答案超简单，到python的安装目录下找到copy到其中的第三方模块的文件和文件夹（如果有的话），直接删除就可以了。

2009年1月23日星期五

活着的人要更好的活下去_韩寒_新浪博客

2009年1月22日星期四

本月任务

快到春节了，这段时间最容易浪费了～

尽早计划一下，充分利用好这段时间～

几项任务：

cops用于TS建模；
svm用于TS建模；
虚拟数据采集；
仿真平台完成；
显示模块完成；

其中任务1和2是承接前面的工作，用新的算法解决遗留问题。任务3-5需要大量的编程工作，3维显示要用到Panda3D，头部活动感应要用到Wii，还有仿真平台内agent的决策，都需要做大量的工作。

先从Wiki入手，学习svm，利用一切可以上网的时间，来代替在网上的闲逛。然后用libsvm+python来进行TS建模。cops方法有matlab的程序，看看能不能快速的应用到TS建模中。这两项任务在春节前完成。编程活动放在春节后进行。

2009年1月17日星期六

python中socket编程

这里（Socket Programming HOWTO）是讲python中的socket编程，很好，很强大。通过这个网页我学会了select，知道了如何非阻塞的用UDP接收数据。这样一来就可以用定时器代替线程了，简单了很多。
以前看过几次select，但一直不明白怎么用。今天总算是明白是干嘛的了～～

2009年1月15日星期四

经纬度与UTM坐标的转换

昨天花了一个晚上的时间搞定了经纬度到UTM坐标的转换，UTM到经纬度的转换还没来得及弄，不过有了前者，后者就有底了～
按理说来，这个LL2UTM的转换以前就有中文的网页介绍过，具体网址我记不清了，应该很容易搜到，还有用C语言实现的函数代码，用的是UGSC的方法。我用它来转换经纬度到UTM，画出的地图和实际的很像，就想当然的认为已经搞定了。
昨天因为需要进行逆向转换，也就是UTM2LL，所有在网上搜索，G到了这个网页，作者是Steven Dutch，最后更新是08年4月1号，愚人节哦～里面讲得挺清楚的，有LL2UTM，也有UTM2LL。
LL2UTM有两种方法，一种是美国军方给出的，一种是美国地质测量部给出的，分别简称为Army和USGS。至于UTM2LL，Army的方法牵扯到查表插值，所以只给出了USGS的方法。最让人开心的是Steven还给出了包含转换实现的Excel文件下载！这个东西作用大大的，后面会说到。
我开始没有多想，直接拿我前面转换好的UTM的X、Y坐标填入Excel文件中，转换出来的值却是和原始的经纬度值大不一样！如果UTM2LL的转换没有错的话，那么就是我的LL2UTM出错了！拿原始经纬度值填入Excel文件中转换成UTM坐标，和我转换出来的果然不一样！还差了不少！
仔细看了看Excel文件中转换的中间值，发现我的中央经纬度值就出错了！原先我写的LL2UTM函数需要把中央经纬度值作为输入参数，我手动算118度的中央经纬度居然算出了120，实际上应该是117，改了以后再转换一遍，就差不多了。但看着尚存的误差，我还是有点不甘心。我原先写的函数实际上用的是USGS方法，Steven建议的是用Army的方法，于是我就又用python实现一遍。这次就曲折大了！
按照Steven给出的公式写完函数后，填入原始经纬度值调用，发现出来的UTM值查了好多！仔细检查了好几遍，没错啊！还好Excel文件中有中间过程值显示，就把中间结果都打印出来比较，发现是最后算K*时出的错。看了Excel中的计算公式，发现问题所在了：和网页中的公式不一样！按照Excel中的计算公式改了再试，可以了～～

从经纬度转换为UTM坐标（USGS方法）

def LL2UTM_USGS(a, f, lat, lon, lonOrigin, FN):
'''
** Input：(a, f, lat, lon, lonOrigin, FN)
** a 椭球体长半轴
** f 椭球体扁率 f=(a-b)/a 其中b代表椭球体的短半轴
** lat 经过UTM投影之前的纬度
** lon 经过UTM投影之前的经度
** lonOrigin 中央经度线
** FN 纬度起始点，北半球为0，南半球为10000000.0m
---------------------------------------------
** Output:(UTMNorthing, UTMEasting)
** UTMNorthing 经过UTM投影后的纬度方向的坐标
** UTMEasting 经过UTM投影后的经度方向的坐标
---------------------------------------------
** 功能描述：UTM投影
** 作者： Ace Strong
** 单位： CCA NUAA
** 创建日期：2008年7月19日
** 版本：1.0
** 本程序实现的公式请参考
** "Coordinate Conversions and Transformations including Formulas" p35.
** & http://www.uwgb.edu/dutchs/UsefulData/UTMFormulas.htm
'''

# e表示WGS84第一偏心率,eSquare表示e的平方
eSquare = 2*f - f*f
k0 = 0.9996

# 确保longtitude位于-180.00----179.9之间
lonTemp = (lon+180)-int((lon+180)/360)*360-180
latRad = math.radians(lat)
lonRad = math.radians(lonTemp)
lonOriginRad = math.radians(lonOrigin)
e2Square = (eSquare)/(1-eSquare)

V = a/math.sqrt(1-eSquare*math.sin(latRad)**2)
T = math.tan(latRad)**2
C = e2Square*math.cos(latRad)**2
A = math.cos(latRad)*(lonRad-lonOriginRad)
M = a*((1-eSquare/4-3*eSquare**2/64-5*eSquare**3/256)*latRad
-(3*eSquare/8+3*eSquare**2/32+45*eSquare**3/1024)*math.sin(2*latRad)
+(15*eSquare**2/256+45*eSquare**3/1024)*math.sin(4*latRad)
-(35*eSquare**3/3072)*math.sin(6*latRad))

# x
UTMEasting = k0*V*(A+(1-T+C)*A**3/6
+ (5-18*T+T**2+72*C-58*e2Square)*A**5/120)+ 500000.0
# y
UTMNorthing = k0*(M+V*math.tan(latRad)*(A**2/2+(5-T+9*C+4*C**2)*A**4/24
+(61-58*T+T**2+600*C-330*e2Square)*A**6/720))
# 南半球纬度起点为10000000.0m
UTMNorthing += FN
return (UTMEasting, UTMNorthing)

从经纬度转换到UTM（Army方法）

def LL2UTM_Army(a, b, lat_ll, lon_ll, FN):
'''
** Input：(a, b, lat, lon, FN)
** a 椭球体长半轴
** b 椭球体短半轴
** lat_ll 经过UTM投影之前的纬度(角度为单位)
** lon_ll 经过UTM投影之前的经度(角度为单位)
** FN 纬度起始点，北半球为0，南半球为10000000.0m
---------------------------------------------
** Output:(UTMEasting, UTMNorthing)
** UTMNorthing 经过UTM投影后的纬度方向的坐标
** UTMEasting 经过UTM投影后的经度方向的坐标
---------------------------------------------
** 功能描述：UTM投影
** 作者： Ace Strong
** 单位： CCA NUAA
** 创建日期：2009年1月14日
** 版本：1.0
** 本程序实现的公式请参考
** http://www.uwgb.edu/dutchs/UsefulData/UTMFormulas.htm
'''

lat = math.radians(lat_ll)
lon = math.radians(lon_ll)
lon0_ll = 6*(int(lon_ll/6)+31)-183
k0 = 0.9996
e = math.sqrt(1-b**2/a**2)
e2 = e**2/(1-e**2)
n = (a-b)/(a+b)
rho = a*(1-e**2)/((1-(e*math.sin(lat))**2)**(3.0/2))
nu = a/((1-(e*math.sin(lat))**2)**(1.0/2))
p = (lon_ll-lon0_ll)*3600/10000
sin1 = math.pi/(180*60*60)

A = a*(1 - n + (5.0/4)*(n**2 - n**3) + (81.0/64)*(n**4 - n**5))
B = (3*a*n/2)*(1 - n + (7.0/8)*(n**2 - n**3) + (55.0/64)*(n**4 - n**5))
C = (15*a*n**2/16)*(1 - n + (3.0/4)*(n**2 - n**3))
D = (35*a*n**3/48)*(1 - n + (11.0/16)*(n**2 - n**3))
E = (315*a*n**4/51)*(1 - n)
S = A*lat - B*math.sin(2*lat) + C*math.sin(4*lat) - D*math.sin(6*lat) + E*math.sin(8*lat)

# y
K1 = S*k0
K2 = nu*math.sin(lat)*math.cos(lat)*sin1**2*k0*(100000000)/2
K3 = (sin1**4*nu*math.sin(lat)*math.cos(lat)**3/24)*(5 - math.tan(lat)**2 + 9*e2*math.cos(lat)**2 + 4*e2**2*math.cos(lat)**4)*k0*(10000000000000000)
UTMNorthing = K1 + K2*p**2 + K3*p**4
# 南半球纬度起点为10000000.0m
UTMNorthing += FN
# x
K4 = k0*sin1*nu*math.cos(lat)*10000
K5 = (sin1*math.cos(lat))**3*(nu/6)*(1 - math.tan(lat)**2 + e2*math.cos(lat)**2)*k0*1000000000000
UTMEasting = K4*p + K5*p**3 + 500000
return (UTMEasting, UTMNorthing)

2009年1月12日星期一

开始尝试svm

前几天在网上看到Google Developer Day上关于机器学习的一个演讲，演讲者之一是台湾的林智仁教授，讲了svm的原理和发展情况。可惜视频上没有演讲时的ppt的内容，他的演讲又是中英混杂的，时不时蹦一个英文单词出来，对于没有相关背景的我来说，有不少都不懂是什么意思。演讲中提到了林教授是svm研究中的先锋人物（当然不是林教授自己说的，是另一个Google的工程师说的），还提到了libsvm工具包。于是顺藤摸瓜找到了libsvm，原来就是林教授开发的！对python语言的支持非常好，是内置在包里的。而且好像用的人也很多，网上相关的资料很多，看了一下，上手很容易。
恰好知道师兄也用过svm，就发个飞信问他用的是不是这个，结果说是ls-svm。上网又搜了一下，这是最小二乘支持向量机：

近年，Suykens J.A.K提出一种新型支持向量机方法—最小二乘支持向量机（Least Squares Support Vector Machines，简称LS-SVM）用于解决模式分类和函数估计问题等.最小二乘支持向量机方法是采用最小二乘线性系统作为损失函数，代替传统的支持向量机采用的二次规划方法。

LS-SVM方法简化了计算的复杂性。另外，由于LS-SVM采用了最小二乘法，因此运算速度明显快于支持向量机的其它版本。

问了一下，另一个师妹也是用的这个。不过ls-svm好像只支持c和matlab，对于其他语言不支持。当然支持c就可以包装成python可用的模块，但毕竟麻烦一点，还有人说：

但是该方法也是以推广性的损失为代价的
要根据使用的范围确定选用什么样的方法

这让我回过神来，这些都是工具，要对svm的原理有了清楚的认识才能合理、正确、有效的使用这些工具。正好师兄的大论文里有关于SVM的详细介绍，看完这个再来选择吧～

2009年1月10日星期六

piaip's Using (lib)SVM Tutorial

piaip's Using (lib)SVM Tutorial piaip 的 (lib)SVM 簡易入門

piaip at csie dot ntu dot edu dot tw,
Hung-Te Lin
Fri Apr 18 15:04:53 CST 2003
$Id: svm_tutorial.html,v 1.13 2007/10/02 05:51:55 piaip Exp piaip $ 原作：林弘德，轉載請保留原出處

Why this tutorial is here

我一直覺得 SVM 是個很有趣的東西，不過也一直沒辦法 (mostly 衝堂) 去聽林智仁老師的 Data mining 跟 SVM 的課；後來看了一些網路上的文件跟聽 kcwu 講了一下 libsvm 的用法後，就想整理一下，算是對於並不需要知道完整 SVM 理論的人提供使用 libsvm 的入門。原始 libsvm 的 README 跟 FAQ 也是很好的文件，不過你可能要先對 svm 跟流程有點了解才看得懂 (我在看時有這樣的感覺)；這篇入門就是為了從零開始的人而寫的。 I've been considering SVM as an interesting and useful tool but couldn't attend the "Data mining and SVM" course by prof. cjline about it (mostly due to scheduling conflicts). After reading some materials on the internet and discussing libsvm with some of my classmates and friends , I wanted to provide some notes here as a tutorial for those who do not need to know the complete theory behind SVM theory to use libsvm . The original README and FAQ files that comes with libsvm are good documents too. But you may need to have some basic knowledge of SVM and its workflow (that's how I felt when I was reading them). This tutorial is specificly for those starting from zero.

後來還有一些人提供意見，所以在此要感謝： I must thank these guys who provided feedback and helped me make this tutorial:

kcwu, biboshen, puffer, somi

不過請記得底下可能有些說法不一定對，但是對於只是想用 SVM 的人來說我覺得這樣說明會比較易懂。 Remember that some aspect below may not be correct. But for those who just wish to "USE" SVM, I think the explanation below is easier to understand.

這篇入門原則上是給會寫基本程式的人看的，也是給我自己一個備忘, 不用太多數學底子，也不用對 SVM 有任何先備知識。 This tutorial is basically for people who already know how to program. It's also a memo to myself. Neither too much mathmatics nor prior SVM knowledge is required.

還看不懂的話有三個情形, 一是我講的不夠清楚, 二是你的常識不足, 三是你是小白 ^^; If you still can't understand this tutorial, there are three possibilities: 1. I didn't explain clearly enough, 2. You lack sufficient common knowledge, 3. You don't use your brain properly ^^;

我自己是以完全不懂的角度開始的，這篇入門也有不少一樣不懂 SVM 的人看過、而且看完多半都有一定程度的理解，所以假設情況一不會發生，那如果不懂一定是後兩個情況 :P 也所以, 有問題別問我。 Since I begin writing this myself with no understanding of the subject, ans this document has been read by many people who also didn't understand SVM but gained a certain level of understanding after reading it, possibility 1 can be ruled out. Thus if you can't understand it you must belong to the latter two categories, :P thus even if you have any questions after reading this, don't ask me.

SVM: What is it and what can it do for me?

SVM, Support Vector Machine , 簡而言之它是個起源跟類神經網路有點像的東西，不過現今最常拿來就是做分類 (classification)。也就是說，如果我有一堆已經分好類的東西 （可是分類的依據是未知的！） ，那當收到新的東西時， SVM 可以預測 (predict) 新的資料要分到哪一堆去。 SVM, Support Vector Machine , is something that has similar roots with neural networks. But recently it has been widely used in Classification. That means, if I have some sets of things classified (But you know nothing about HOW I CLASSIFIED THEM, or say you don't know the rules used for classification), when a new data comes, SVM can PREDICT which set it should belong to.

聽起來是很神奇的事（如果你覺得不神奇，請重想一想這句話代表什麼： 分類的依據是未知的！，還是不神奇的話就請你寫個程式解解看這個問題），也很像要 AI 之類的高等技巧... 不過 SVM 基於 統計學習理論 可以在合理的時間內漂亮的解決這個問題。 It sounds marvelous and would seem to require advanced techniques like AI searching or some time-consuming complex computation. But SVM used some Statistical Learning Theory to solve this problem in reasonable time.

以圖形化的例子來說明(by SVMToy), 像假定我在空間中標了一堆用顏色分類的點, 點的顏色就是他的類別, 位置就是他的資料, 那 SVM 就可以找出區隔這些點的方程式, 依此就可以分出一區區的區域; 拿到新的點(資料) 時, 只要對照該位置在哪一區就可以(predict) 找出他應該是哪一顏色(類別)了: Now we explain with a graphical example(by SVMToy), I marked lots of points with different colors on a plane, the color of each point is its "class" and the location is its data. SVM can then find equations to split these points and with these equations we can get colored regions. When a new point(data) comes, we can find (predict) what color (class) a point should be just by using the point's location (data)

原始資料分佈 Original Data	SVM找出來的區域 SVM Regions

當然 SVM 不是真的只有畫圖分區那麼簡單, 不過看上面的例子應該可以了解 SVM 大概在作什麼. Of course SVM is not really just about painting and marking regions, but with the example above you should should be able to get some idea about what SVM is doing.

要對 SVM 再多懂一點點，可以參考 cjlin 在 data mining 課的 slides: pdf or ps 。
底下我試著在不用看那個 slide 的情況解釋及使用 libsvm。 To get yourself more familiar with SVM, you may refer to the slides cjlin used in his Data Mining course : pdf or ps .
I'm going to try to explain and use libSVM without those slides.

所以, 我們可以把 SVM 當個黑盒子, 資料丟進去讓他處理然後我們再來用就好了. Thus we can consider SVM as a black box. Just push data into SVM and use the output.

How do I get SVM?

林智仁(cjlin)老師的 libsvm 當然是最完美的工具. Chih-Jen Lin's libsvm is of course the best tool you can ever find.

Download libsvm

下載處: Download Location:

libsvm.zip or libsvm.tar.gz

.zip 跟 .tar.gz 基本上是一樣的, 只是看你的 OS; 習慣上 Windows 用 .zip 比較方便 (因為有 WinZIP, 不過我都用 WinRAR), UNIX 則是用 .tar.gz Contents in the .zip and .tar.gz are the same. People using Windows usually like to use .zip files because they have WinZIP, which I always replace with WinRAR. UNIX users mostly prefer .tar.gz

Build libsvm

解開來後, 假定是 UNIX 系統, 直接打 make 就可以了; 編不出來的話請詳讀說明和運用常識. 因為這是 tutorial, 所以我不花時間細談, 而且會編不出來的情形真是少之又少, 通常一定是你的系統有問題或你太笨了. 其他的子目錄可以不管, 只要 svm-train, svm-scale, svm-predict 三個執行檔有編出來就可以了. After you extracted the archives, just type make if you are using UNIX. You may ignore some of the subdirectories. We only need these executable files: svm-train, svm-scale, and svm-predict

Windows 的用戶要自己重編當然也是可以, 不過已經有編好的 binary 在裡面了: 請檢查 windows 子目錄, 應該會有 svmtrain.exe, svmscale.exe, svmpredict.exe, svmtoy.exe . Windows users may rebuild from source if you want, but there're already some prebuilt binaries in the archive: just check your "windows" subdirectory and you should find svmtrain.exe, svmscale.exe, svmpredict.exe, and svmtoy.exe .

Using SVM

libsvm 有很多種用法, 這篇 tutorial 只打算講簡單的部分. libsvm has lots of functions. This tutorial will only explain the easier parts (mostly classification with default model).

The programs

解釋一下幾個主要執行檔的作用: (UNIX/Windows 下檔名稍有不同, 請用常識理解我在講哪個) I'm going to describe how to use the most important executables here. The filenames are a little bit different under Unix and Windows, apply common sense to see which I'm referring to.

svmtrain: Train (訓練) data. 跑 SVM 被戲稱為 "開火車" 也是由於這個程式名而來. train 會接受特定格式的輸入, 產生一個 "Model" 檔. 這個 model 你可以想像成 SVM 的內部資料, 因為 predict 要 model 才能 predict, 不能直接吃原始資料. 想想也很合理, 假定 train 本身是很耗時的動作, 而 train 好可以以某種形式存起內部資料, 那下次要 predict 時直接把那些內部資料 load 進來就快多了. Use your data for training. Running SVM is often referred to as 'driving trains' by its non-native English speaking authors because of this program. svmtrain accepts some specifically format which will be explained below and then generate a 'Model' file. You may think of a 'Model' as a storage format for the internal data of SVM. This should appear very reasonable after some thought, since training with data is a time-consuming process, so we 'train' first and store the result enabling the 'predict' operation to go much faster.
svmpredict: 依照已經 train 好的 model, 再加上給定的輸入 (新值), 輸出 predict (預測) 新值所對應的類別 (class). Output the predicted class of the new input data according to a pre-trained model.
svmscale: Rescale data. 因為原始資料可能範圍過大或過小, svmscale 可以先將資料重新 scale (縮放) 到適當範圍. Rescale data. The original data maybe too huge or small in range, thus we can rescale them to the proper range so that training and predicting will be faster.

File Format

檔案格式要先交代一下. 你可以參考 libsvm 裡面附的 "heart_scale": This is the input file format of SVM. You may also refer to the file "heart_scale" which is bundled in official libsvm source archive.

[label] [index1]:[value1] [index2]:[value2] ... [label] [index1]:[value1] [index2]:[value2] ... . .

一行一筆資料，如 One record per line, as:

+1 1:0.708 2:1 3:1 4:-0.320 5:-0.105 6:-1

label: 或說是 class, 就是你要分類的種類，通常是一些整數。 Sometimes referred to as 'class', the class (or set) of your classification. Usually we put integers here.
index: 是有順序的索引，通常是放連續的整數。 Ordered indexes. usually continuous integers.
value: 就是用來 train 的資料，通常是一堆實數。 The data for training. Usually lots of real (floating point) numbers.

每一行都是如上的結構, 意思就是: 我有一排資料, 分別是 value1, value2, .... valueN, (而且它們的順序已由 indexN 分別指定)，這排資料的分類結果就是 label。 Each line has the structure described above. It means, I have an array(vector) of data(numbers): value1, value2, .... valueN (and the order of the values are specified by the respective index), and the class (or the result) of this array is label.

或許你會不太懂，為什麼會是 value1,value2,.... 這樣一排呢？這牽涉到 SVM 的原理。你可以這樣想（我沒說這是正確的），它的名字就叫 Support "Vector" Machine，所以輸入的 training data 是 "Vector"(向量), 也就是一排的 x1, x2, x3, ... 這些值就是 valueN，而 x[n] 的 n 就是由 indexN 指定。這些東西又稱為 "attribute"。

真實的情況是，大部份時候我們給定的資料可能有很多 "特徵(feature)" 或說 "屬性(attribute)"，所以輸入會是一組的。舉例來說，以前面畫點分區的例子來說，我們不是每個點都有 X 跟 Y 的座標嗎？所以它就有兩種 attribute。假定我有兩個點： (0,3) 跟 (5,8) 分別在 label(class) 1 跟 2 ，那就會寫成 1 1:0 2:3 2 1:5 2:8 同理，空間中的三維座標就等於有三組 attribute。 Maybe it's confusing to you: why value, value2, ...? The reason is usually the input data to the problem you were trying to solve involves lots of 'features', or say 'attributes', so the input will be a set (or say vector/array). Take the Marking points and find region example described above, we assumed each point has coordinates X and Y so it has two attributes (X and Y). To describe two points (0,3) and (5,8) as having labels(classes) 1 and 2, we will write them as: 1 1:0 2:3 2 1:5 2:8 And 3-dimensional points will have 3 attributes.

這種檔案格式最大的好處就是可以使用 sparse matrix，或說有些 data 的 attribute 可以不存在。 This kind of fileformat has the advantage that we can specify a sparse matrix, ie. some attribute of a record can be omitted.

To Run libsvm

來解釋一下 libsvm 的程式怎麼用。你可以先拿 libsvm 附的 heart_scale 來做輸入，底下也以它為例： Now I'll show you how to use libsvm. You may use the heart_scale file in the libsvm source archive as input, as I'll do in this example:

看到這裡你應該也了解，使用 SVM 的流程大概就是： You should have a sense that using libsvm is basically:

準備資料並做成指定格式 (有必要時需 svmscale) Prepare data in specified format and svmscale it if necessary.
用 svmtrain 來 train 成 model Train the data to create a model with svmtrain.
對新的輸入，使用 svmpredict 來 predict 新資料的 class Predict new input data with svmpredict and get the result.

svmtrain

svmtrain 的語法大致就是:

The syntax of svmtrain is basically:

svmtrain [options] training_set_file [model_file]

training_set_file 就是之前的格式，而 model_file 如果不給就會叫 [training_set_file].model。 options 可以先不要給。

The format of training_set_files is described above. If the model_file is not specified, it'll be [training_set_file].model by default. Options can be ignored at first.

下列程式執行結果會產生 heart_scale.model 檔：(螢幕輸出不是很重要，沒有錯誤就好了) The following command will generate the heart_scale.model file. The screen output may be ignored if there were no errors.

./svm-train heart_scale optimization finished, #iter = 219 nu = 0.431030 obj = -100.877286, rho = 0.424632 nSV = 132, nBSV = 107 Total nSV = 132

svmpredict

svmpredict 的語法是 : The syntax to svm-predict is:

svmpredict test_file model_file output_file

test_file 就是我們要 predict 的資料。它的格式跟 svmtrain 的輸入，也就是 training_set_file 是一樣的！ predict 完會順便拿 predict 出來的值跟 test_file 裡面寫的值去做比對，這代表： test_file 寫的 label 是真正的分類結果，拿來跟我們 predict 的結果比對就可以知道 predict 有沒有猜對了。 test_file is the data the we are going to 'predict'. Its format is almost exactly the same as the training_set_file, which we fed as input to svmtrain. After predicting svm-predict will compare the predicted label with the label written in test_file. That means, test_file has the real (or correct) result of classification, and after comparing with our predicted result we can know whether the prediction is correct or not.

也所以，我們可以拿原 training set 當做 test_file再丟給 svmpredict 去 predict (因為格式一樣)，看看正確率有多高，方便後面調參數。 So we can use the original training_set_file as test_file and feed it to svmpredict for prediction (nothing different in file format) and see how high the accuracy is so we can optimize the arguments.

其它參數就很好理解了： model_file 就是 svmtrain 出來的檔案， output_file 是存輸出結果的檔案。 Other arguments should be easy to figure out now: model_file is the model trained by svmtrain, and output_file is where we store the output result.

輸出的格式很簡單，每行一個 label，對應到你的 test_file 裡面的各行。 Format of output is simple. Each line contains a label corresponding to your test_file.

下列程式執行結果會產生 heart_scale.out： The following commands will generate heart_scale.out:

./svm-predict heart_scale heart_scale.model heart_scale.out Accuracy = 86.6667% (234/270) (classification) Mean squared error = 0.533333 (regression) Squared correlation coefficient = 0.532639(regression)

As you can see，我們把原輸入丟回去 predict，第一行的 Accuracy 就是預測的正確率了。如果輸入沒有 label 的話，那就是真的 predict 了。 As you can see, after we 'predict'ed the original input, we got 'Accuracy=86.6667%" on first line as accuracy of prediction. If we don't put labels in input, the result is real prediction.

看到這裡，基本上你應該已經可以利用 svm 來作事了：你只要寫程式輸出正確格式的資料，交給 svm 去 train，後來再 predict 並讀入結果即可。 Now you can use SVM to do whatever you want! Just write a program to output its data in the correct format, feed the data to SVM for training, then predct and read the output.

Advanced Topics

後面可以說是一些稍微進階的部份，我可能不會講的很清楚，因為我的重點是想表達一些觀念和解釋一些你看相關文件時很容易碰到的名詞。 These are a little advanced and I may not explain very clearly. Because I just want to help you get familiar with some of the terminology and ideas that you'll encounter when you read other (lib)SVM documents.

Scaling

svm-scale 目前不太好用，不過它有其必要性。因為適當的scale有助於參數的選擇(後述)還有解svm的速度。
svmscale 會對每個 attribute 做scale。範圍用 -l, -u 指定，通常是[0,1]或是[-1,1]。輸出在 stdout。
另外要注意的(常常會忘記)是 testing data 和 training data要一起scale。
而 svm-scale 最難用的地方就是沒辦法指定 testing data/training data(不同檔案) 然後一起scale。

svm-scale is not easy to use right now, but it is important. Scaling aids the choosing of arguments (described below) and the speed of solving SVM.
svmscale rescales all atrributes with the specified (by -l, -u) range, usually [0,1] or [-1,1].
Please keep in mind that testing data and training data MUST BE SCALED WITH THE SAME RANGE. Don't forget to scale your testing data before you predict.
We can't specify the testing and training data file together and scale them in one command, that's why svm-scale is not so easy to use right now.

Arguments

前面提到，在 train 的時候可以下一些參數。(直接執行 svm-train 不指定輸入檔與參數會列出所有參數及語法說明) 這些參數對應到原始 SVM 公式的一些參數，所以會影響 predict 的正確與否。
舉例來說，改個 c=10:
./svm-train -c 10 heart_scale
再來 predict ，正確率馬上變成 92.2% (249/270)。

We know that we can use some arguments when we were training data (Running svm-train without any input file or arguments will cause it to print its list syntax help and complete arguments). These arguments corresponds to some arguments in original SVM equations so they will affect the accuracy of prediction.
Let's use c=10 as an example:
./svm-train -c 10 heart_scale
If you predict again now, the accuracy will be 92.2% (249/270).

Cross Validation

一般而言， SVM 使用的方式(在決定參數時)常是這樣：

先有已分好類的一堆資料
亂數拆成好幾組 training set
用某組參數去 train 並 predict 別組看正確率
正確率不夠的話，換參數再重複 train/predict

Mostly people use SVM while following this workflow:

Prepare lots of pre-classified (correct) data
Split them into several training sets randomly.
Train with some arguments and predict other sets of data to calculate the accuracy.
Change the arguments and repeat until we get good accuracy.

等找到一組不錯的參數後，就拿這組參數來建 model 並用來做最後對未知資料的 predict。這整個過程叫 cross validation ，也就是交叉比對。 When we got some nice arguments, we will then use them to train the model and use the model for final prediction (on unknown test data). This whole process is called cross validation .

在我們找參數的過程中，可以利用 svmtrain 的內建 cross validation 功能來幫忙：
-v n: n-fold cross validation
n 就是要拆成幾組，像 n=3 就會拆成三組，然後先拿 1跟2來 train model 並 predict 3 以得到正確率；再來拿 2跟 3 train 並 predict 1，最後 1,3 train 並 predict 2。其它以此類推。 In the process of experimenting with the arguments, we can use the built-in support for validation of svmtrain:
-v n: n-fold cross validation
n is how many sets to split your input data. Specifing n=3 will split data into 3 sets; train the model with data set 1 and 2 first then predict data set 3 to get the accuracy, then train with data set 2 and 3 and predict data set 1, finally train 1,3 and predict 2, ... ad infinitum.

如果沒有交叉比對的話，很容易找到只在特定輸入時好的參數。像前面我們 c=10 得到 92.2%，不過拿 -v 5 來看看： ./svm-train -v 5 -c 10 heart_scale ... Cross Validation Accuracy = 80.3704% 平均之後才只有 80.37%，比一開始的 86 還差。 If we don't use cross validation, sometimes we may be fooled by some arguments only good for some special input. Like the example we used above, c=10 has 92.2%. If we do so with -v 5: ./svm-train -v 5 -c 10 heart_scale ... Cross Validation Accuracy = 80.3704% After the prediction results is averaged with cross validation we have only 80.37% accuracy, even worse than with the original argument (86%).

What arguments rules?

通常而言，比較重要的參數是 gamma (-g) 跟 cost (-c) 。而 cross validation (-v) 的參數常用 5。 Generally speaking, you will only modify two important arguments when you are using training with data: gamma (-g) and cost (-c) . And cross validation (-v) is usually set to 5.

cost 預設值是 1, gamma 預設值是 1/k ，k 等於輸入資料筆數。那我們怎麼知道要用多少來當參數呢？

用試的
是的，別懷疑，就是 Try 參數找比較好的值。

cost is 1 by default, and gamma has default value = 1/k , k = number of input records. Then how do we know what value to choose as arguments?

T R Y
Yes. Just by trial and error.

Try 參數的過程常用 exponential 指數成長的方式來增加與減少參數的數值，也就是 2^n (2 的 n 次方)。 When experimenting with arguments, the value usually increases and decreases in exponential order. i.e., 2^n.

因為有兩組參數，所以等於要 try n*n=n^2 次。這個過程是不連續的成長，所以可以想成我們在一個 X-Y 平面上指定的範圍內找一群格子點 (grid，如果你不太明白，想成方格紙或我們把平面上所有整數交點都打個點，就是那樣)，每個格子點的 X 跟 Y 經過換算 (如 2^x, 2^y) 就拿去當 cost 跟 gamma 的值來 cross validation。 Because we have two important arguments, we have to try n*n=n^2 times. The whole process is discontinous and can be thought of as finding the grid points on a specified region (range) of the X-Y plane (Think of marking all integer interception points on a paper). Convert each grid point's X and Y coordinate to exponential values (like 2^x, 2^y) then we can use them as value of cost and gamme for cross validation.

所以現在你應該懂得 libsvm 的 python 子目錄下面有個 grid.py 是做啥的了：它把上面的過程自動化，在你給定的範圍內呼叫 svm-train 去 try 所有的參數值。 python 是一種語言，在這裡我不做介紹，因為我會了 :P (just a joke，真正原因是 -- 這是 libsvm 的 tutorial)。 grid.py 還會把結果 plot 出來，方便你尋找參數。 libsvm 有很多跟 python 結合的部份，由此可見 python 是強大方便的工具。很多神奇的功能，像自動登入多台機器去平行跑 grid等等都是 python 幫忙的。不過 SVM 本身可以完全不需要 python，只是會比較方便。 So look for 'grid.py' in the 'python' subdirectory inside the libsvm archive. You should know what it does now: automatically execute the procedure above, try all argument values by calling svm-train within the region specified by you. Python is a programming language which I'm not going to explain here. grid.py will also plot the result graphically to help you look for good arguments. There're also many parts of libsvm powered by python, like logging into several hosts and running grids at the same time parallel. Keep in mind that libsvm can be used without python entirely. Python just only helped us to do thinks quickly.

跑 grid (基本上用 grid.py 跑當然是最方便，不過如果你不懂 python 而且覺得很難搞，那你要自己產生參數來跑也是可以的) 通常好的範圍是 [c,g]=[2^-10,2^10]*[2^-10,2^10] 另外其實 grid 用 [-8,8] 也很夠了。 Running for grids (it's more convenient to just use grid.py but it's also ok if you don't) you may choose the range as [c,g]=[2^-10,2^10]*[2^-10,2^10] Usually [-8,8] is enough for grids.

Regression

另一個值得一提的是 regression。

簡單來說，前面都是拿 SVM 來做分類 (classification), 所以 label 的值都是 discrete data、或說已知的固定值。而 regression 則是求 continuous 的值、或說未知的值。你也可以說，一般是 binary classification, 而 regression是可以預測一個實數。

比如說我知道股市指數受到某些因素影響, 然後我想預測股市.. 股市的指數就是我們的 label, 那些因素量化以後變成 attributes。以後蒐集那些 attributes 給 SVM 它就會預測出指數(可能是沒出現過的數字)，這就要用 regression。那樂透開獎的號碼呢？因為都是固定已知的數字，很明顯我們應該用一般 SVM 的 classification 來 predict。 (註：這是真實的例子 -- llwang 就寫過這樣的東西)

所以說 label 也要 scale, 用 svm-scale -y lower upper

但是比較糟糕的情況是 grid.py 不支援 regression ，而且cross validation 對 regression 也常常不是很有效。

總而言之，regression 是非常有趣的東西，不過也是比較進階的用法。在這裡我們不細談了，有興趣的人請再參考 SVM 與 libsvm 的其它文件。

The other important issue is "Regression".

To explain briefly, we only used SVM to do classification in this tutorial. The type of label we used are always discrete data (ie. a known fixed value). "Regression" in this context means to predict labels with continuous values (or unknown values). You can think of classification as predictions with only binary outcomes, and regression as predictions that output real (floating point) numbers.

Thus to predict lottery numbers (since they are always fixed numbers) you should use classification, and to predict the stock market you need regression.

The labels must also be scaled when you use regression, by svm-scale -y lower upper

However grid.py does not support regression, and cross validation sometimes does not work well with regression.

Regression is interesting but also advanced. Please refer to other documents for details.

Epilogue

到此我已經簡單的說明了 libsvm 的使用方式，更完整的用法請參考 libsvm 的說明跟 cjlin 的網站、 SVM 的相關文件，或是去上 cjlin 的課。 Here we have already briefly explained the libsvm software. For complete usage guides please refer to documents inside the libsvm archive, cjlin's website, SVM-related documents, or go take cjlin's course if you are a student at National Taiwan University :)

對於 SVM 的新手來說， libsvmtools 有很多好東西。像 SVM for dummies 就是很方便觀察 libsvm 流程的東西。 Take a glance at libsvmtools especially "SVM for dummies" there. Those are good tools for SVM newbies that helps in observing libsvm workflow.

Copyright

All HTML/text typed within VIM on Solaris.
Style sheet from W3C Core StyleSheets.

Original URL: http://www.csie.ntu.edu.tw/~r91034/svm/svm_tutorial.html

订阅：博文 (Atom)