蛋塔賣你 (Data Mining)

Aug 15 Sat 2015 12:21
一次搞懂大數據 (數位時代2015April)

一次搞懂大數據（上/下）

撰文者：李欣宜發表日期：2015/04/01

http://www.bnext.com.tw/article/view/id/35807

http://www.bnext.com.tw/article/view/id/35809

當大數據充斥各種場合，從馬雲到釋昭慧都侃侃而談，你還能不懂什麼是大數據嗎？你也許已經聽過無數的大數據神話，但對於大數據仍停留在一知半解階段，《數位時代》專訪各大大數據專家，整理出你最應該知道的大數據十問。

Q：大數據是什麼？

A：大數據（Big Data）又被稱為巨量資料，其概念其實就是過去10年廣泛用於企業內部的資料分析、商業智慧（Business Intelligence）和統計應用之大成。但大數據現在不只是資料處理工具，更是一種企業思維和商業模式，因為資料量急速成長、儲存設備成本下降、軟體技術進化和雲端環境成熟等種種客觀條件就位，方才讓資料分析從過去的洞悉歷史進化到預測未來，甚至是破舊立新，開創從所未見的商業模式。

一般而言，大數據的定義是Volume（容量）、Velocity（速度）和Variety（多樣性），但也有人另外加上Veracity（真實性）和Value（價值）兩個V。但其實不論是幾V，大數據的資料特質和傳統資料最大的不同是，資料來源多元、種類繁多，大多是非結構化資料，而且更新速度非常快，導致資料量大增。而要用大數據創造價值，不得不注意數據的真實性。

Volume、Velocity、Variety + Veracity = Value

大數據和商業分析之異同

Q：為什麼需要大數據？

A：因為當從人到機器都已經被數據解構，數據不僅僅是歐巴馬口中的石油或是黃金，它更是血液，貫穿每個人一生中每個生命階段。這並非危言聳聽，更不是科幻電影，而是正在逐步成真的現實。

例如有一款叫做Ovia Fertility的App，藉由分析30萬名會員的數據，開發演算法，精準計算排卵期，提高懷孕的機率，這個App已幫助5萬名會員成功懷孕。又比如Workday推出一套軟體，預測員工的薪水漲幅和可能跳槽時間，幫助企業決定每名員工的加薪幅度、時間點和轉職時機。理財也逃不過大數據的掌控，騰訊就於年初推出第一家用大數據決定借貸與否的銀行，微眾銀行結合辨識人臉和公安部門資料，決定借貸者的信用等級。

從懷孕生子、工作到理財，大數據將全面影響每個人與每家企業。對企業而言，大數據可望提升服務品質、增加管理效率、幫助決策和創造商業模式；對一般民眾而言，大數據是另一個自我，它可能比本人更了解本人，為你預先解決每個未知，當一切都開始數據化，你能夠不需要數據嗎？

Q：大數據一定要很大嗎？

A：雖然大數據的狹義定義是，資料量要在100TB到PB之間，但其實絕大多數的企業，都不符合這個標準，大企業如eBay、亞馬遜或AT&T或許符合大數據的標準。但其實資料量只是大數據的其中一個面向，大數據揭示的是一種「資料經濟」的精神，而非只是「大」。

「大，是大數據中最無趣的部分。」天睿資訊（Teradata）首席技術長寶立明（Stephen Brobst）認為，企業真正要尋找的是非傳統的、而且未曾被挖掘過的資料，並且從這些資料中去提煉出價值，這才是對大數據應有的正確認知，而非只是執著於資料大小，只要能從看似毫無意義的數據礦坑中挖掘出金礦，有誰會在意那座礦坑原本是大得像座山還是小得像狗屋呢？和沛科技創辦人翟本喬就指出，大數據這個名字容易讓人誤導，因為真正重要的其實是大智慧。大數據不只是說資料量有多大，速度快和資料量大都可以用技術輕易解決，但種類（Variety）比較需要智慧。

Q：沒有大數據就不能用大數據嗎？

A：非也，建置大數據架構與環境的確所費不貲，一般中小企業通常無法輕易投入鉅額成本，但大數據時代的精神在於如何妥善利用既有或非傳統資料，從中挖掘出新商機，因此即使是中小企業甚或者是新創企業，都能在大數據時代用「大數據」。

就技術面來說，現在有許多業者開始提供建置成本較低的大數據處理工具和雲端系統，有些甚至跟App一樣，只要根據自身需求挑選需要購買的功能即可，例如科智提供的工業化數據管理工具即為一例。另一方面，很多時候中小企業其實不需要建設大數據系統。中研院資訊科學研究所研究員陳昇瑋即指出，在絕大多數情況下，大數據專案其實不需要建置Hadoop系統，特別是台灣的社群媒體沒那麼發達，而是直接採用國外的居多，資料都不在自己手上，與其盲目追求技術和工具，不如先用小量資料去驗證一個概念，是否能將資料轉換成商業機會，再來決定要不要建置大數據的作業環境。

大數據領域權威麥爾苟伯格（Viktor Mayer-Schönberger）在《大數據》一書中便提及，大公司有巨量資料的規模優勢，但小公司有成本及創新上的優勢，小公司因為速度夠快、靈活度高，就算維持小規模，還是能夠蓬勃發展。

Q：我要怎麼開始進行大數據專案？

A：設置專門統籌大數據專案的部門和職銜是第一步，而且層級越高越好，企業領導人必須足夠正視大數據的力量，才能帶動整個組織重視數據的文化。Etu負責人蔣居裕便指出，大數據其實是管理問題，而非技術問題，缺少跨部門協作，大數據專案很難有個美好的開始。

第二步，切勿陷入大數據迷思，與其急著想用數據變現，不如先回頭看看自己企業內部的問題為何，先定義問題，再試圖用數據找解方。阿里巴巴集團副總裁車品覺建議，與其整天想著大數據，不如先整頓自己企業內部的數據，很多時候光是企業內部的數據就問題叢生，不同部門之間的數據無法相容，「整個數據在一個中小企業裡面也是四分五裂，在這個地方沒做好的情況下，居然說你想用大數據，其實是有點難以理解。」

當大數據充斥各種場合，從馬雲到釋昭慧都侃侃而談，你還能不懂什麼是大數據嗎？你也許已經聽過無數的大數據神話，但對於大數據仍停留在一知半解階段，《數位時代》專訪各大大數據專家，整理出你最應該知道的大數據十問。

Q：大數據從哪來？

A：任何地方。隨著物聯網興起，任何以前不可能產生資料的東西或地方都可能「資料化」。天睿資訊（Teradata）首席技術長寶立明認為大數據的發展可以分成三階段，正說明了大數據的來源多樣化：.com時期、社群網路時期和物聯網時期。早在2000年初網路熱潮興起，人們就已經開始研究log資料，蒐集使用者的cookie和搜尋行為等。而社群網路如Facebook或Twitter將人們的互動關係數據化，這些社群數據創造了大量的商業價值。而第三階段物聯網時期，可能是最有趣的階段，無論是機器還是人都開始被數據解構，數據可能來自手錶、鞋墊甚至皮帶，這些物聯網數據將是接下來重要的數據分析對象。

Q：大數據有什麼風險？

A：傳統商業分析會有的風險，大數據也都會有，這並非大數據才有的問題，「個資安全問題」一直都存在，只是隨著資料來源越來越多且資料量越來越大，資安問題更顯迫切罷了。市場研究機構Gartner研究副總裁布萊恩（Brian Prentice）指出，大數據本身並沒有資安問題，問題在企業應用資料的方式，Gartner預測2018年，企業違反商業倫理的案件中，有近50％都來自不當的大數據應用。

另一值得關切的是大數據可能帶來的「資料獨裁問題」，根據大數據領域權威麥爾苟伯格（Viktor Mayer-Schönberger）的說法，資料獨裁指的是任由資料來管控我們，盲目受到分析結果的制約，導致濫用或誤用資料。例如根據數據分析將人群分類，其實有可能會把個體給標籤化，甚至污名化某些族群，想像未來若我們用數據預先打擊犯罪，那會是什麼情景？

Q：Big Data和Open Data有什麼不一樣？

A：開放資料（Open Data）是大數據的一種，但大數據不等同於開放資料。開放資料是指將原本受私人組織或公部門管理的原始資料無條件地開放出來，供任何人使用。近年來討論度較高的是公部門的原始資料，許多民間團體主張公部門資料本為民眾所有，除非涉及個人隱私，否則公部門應無條件開放資料，讓民間可以介接資料，除了瀏覽，還可以加值應用。

對新創企業而言，開放資料是非常好的資源，當創新遇上開放資料，很可能激起無盡想像。例如李慕約有限公司創辦人李慕約就利用政府開放的農產品即時價值資料，設計出果菜花終端機，用視覺化的圖表呈現農糧署累積近20年的資料。

Q：什麼產業特別需要大數據解決方案？

A：根據Gartner的報告，媒體傳播業、銀行業和服務業最早導入大數據，保險業、零售業和醫療照護業預計在兩年內導入，但阿里巴巴副總裁車品覺指出，以後任何一種產品或服務都潛藏著巨大的「數據化」潛力，企業需要加強對數據的重視，更加注重數據的蒐集和整理工作。

根據《大數據@工作力》一書作者湯瑪斯．戴文波特（Thomas H. Davenport）的說法，他根據資料量、所有權和資料應用程度，將產業分成高成就者、資料劣勢者和低成就者。高成就者是那些擁有大量數據，而且已經展現出優異的數據分析成果的企業，例如消費性商品、保險業者、互聯網公司、旅遊、運輸和信用卡公司。在所有互聯網公司中，又以電子商務業者對數據的應用最直接和強烈。以全球最大的電商平台阿里巴巴為例，阿里巴巴假貨問題向來猖獗，但透過分析商品文字、圖片描述、權利人投訴，甚至是社交媒體等16種維度的數據，結合大數據打假貨，現在阿里巴巴有90％以上的下架商品都是大數據系統主動出擊發現的。

而低成就者是坐擁大量資料，但因法規限制或思維僵化等原因，還沒利用數據變現的產業，如媒體、電信、銀行和零售，但其中仍不乏已開始使用數據的例子。例如大型零售業者卡特琳娜行銷集團（Catalina Marketing）就藉由分析超過1億人的消費紀錄，結合旗下5萬5千家零售店舖的POS機資料，交叉比對顧客的消費紀錄，針對顧客的消費喜好發送優惠券，提高行銷效率。

資料劣勢者則是手邊資料不多，或是雖有足夠資料，卻缺乏完整結構的業者，也較缺乏資料分析能力，例如許多B2B公司沒有辦法接觸到第一線的消費者，而是提供服務給下游廠商，致其先天上就沒有第一手資料。值得注意的是，醫藥機構雖然被戴文波特列為資料劣勢者，但這是因為美國的病歷電子化程度低，不若台灣擁有全世界最完整的國民健保資料庫，因此台灣的醫療機構應是低成就者，而非資料劣勢者。

資料分析過去在各行業的應用狀況

Q：大數據的商業模式是什麼？

A：大數據的商業模式大概可分成幾種：一、從既有數據變現；二、以數據提升企業競爭力；三、以數據做為服務的基礎與核心，用數據顛覆傳統行業。

模式一，數據本身即為產品或根據數據制定行銷策略、改善產品。例如美國運通讓持卡人與自己的Facebook帳號連結，持卡人成為美國運通粉絲團粉絲後，美國運通會依據會員在Facebook上的活動，提供相應的優惠措施，結合社交數據和會員資料，就是為了提升消費者辦美國運通卡的誘因。

模式二是藉由數據提升競爭力，這類的大數據專案成效較無法直接反映在營收上，而是反映在提升內部工作效率或降低決策成本上。例如許多人都知道LinkedIn透過數據精準推薦職場人脈給用戶，卻不知道LinkedIn在公司內部推出數百款數據分析產品，幫助內部員工提升工作效率，其中Voices就是一款能將LinkedIn客服內容，在1分鐘內快速生成分析報告的數據分析工具。

無論是模式一還是模式二，其實都有掌握過去、預測未來和防患於未然的共同點，只是一個應用層面是對外，一個對內，這兩種模式常見於既有的企業。但模式三，也就是以數據做為業務核心的公司，這些公司生來就是要來顛覆傳統行業，它們打從開業的第一天起就把數據當做業務核心，叫車App Uber和防詐騙電話App Whoscall是最好的例子。

MR. MINING 發表在痞客邦留言(0) 人氣()

個人分類：Article

▲top

Aug 15 Sat 2015 09:08
大數據 (BIG DATA)心得 - 第四章：相關性 (Correlation)

Corrleation

不再拘泥因果關係

這一章是我對作者有最大質疑的地方。

作者強調：我們一向都想知道「為何如此」，但在這禮，「為何如此」沒有什麼助益，只要知道「正是如此」這就行了。當然作者也說明，相關性並不是真的能預知未來，只能說有一定的可能性。但光是如此，便已價值非凡。

這裡值得一提的是，預測分析可魴無法解釋原因，只能顯示確實有了問題。例如它可以警告引擎過熱，但不會告訴你是因為風扇皮帶磨損、還是有某個螺絲沒栓緊。但知道「正是如此」就已經夠了。

大多數的研究計劃都是從設立假說開始，不免就同樣容易受到先入為主的偏見和錯覺所影響。

「我們選擇了什麼，就會影響我們的發現」Danah Boyd, Kate Crawford

MR. MINING 發表在痞客邦留言(0) 人氣()

個人分類：Study

▲top

Aug 08 Sat 2015 12:27
大數據 (BIG DATA)心得 - 第三章：雜亂 (Messy)

Messy.

在小量資料的思維中，我們自然而然、也有必要減少資料錯誤，但作者一再強調一個全新的概念："愈多(資料)"會比"(品質)愈好"更重要!
這裡指的雜亂有分為幾種：

因為資料點愈多，發生的錯誤機率就愈高。
因為結合不同源頭、不同類型的各種資料，資料彼此不一定完全相容，也會增加亂度。
因為資料格式不一致。

巨量資料的概念就是讓數據的重點從"精確"走向"可能性"。例如2加2等於3.9，這樣就夠好了。
很多時候，整體社會的進展，並不只是因為晶片更快、或是演算法更精良了，而是因為有了更多的資料。
長期以來，存取資料庫最常用的程式語言一直是SQL(Structured Query Language)。但近幾年來，已經轉向NoSQL邁進。
Google的MapReduce及Yahoo的開放碼軟體Hadoop都是用來處理大且亂的資料。與關連式資料庫相比，Hadoop輸出的結果比較不準確，不能用在發射太空船、或是查詢銀行帳戶詳細資料等用途。因為它省下原本ETL的過程，直接在所處的位置加以分析。

MR. MINING 發表在痞客邦留言(0) 人氣()

個人分類：Financial Engineering

▲top

Aug 02 Sun 2015 23:46
大數據 (BIG DATA)心得 - 第二章：更多資料 (More)

More.....

Bid data

大數據

Big Data:A Revolution That Will Transform How We Live, Work, and Think

作者：麥爾荀伯格、庫基耶
出版日期：2013/05/30

作者簡介

麥爾荀伯格 Viktor Mayer-Schonberger

　　牛津大學網路研究所教授，
　　並擔任微軟、世界經濟論壇等大公司和組織的顧問，
　　是大數據（巨量資料）領域公認的權威，
　　寫過八本書以及上百篇專論。

庫基耶 Kenneth Cukier

　　《經濟學人》雜誌資料編輯，巨量資料思潮評論員，
　　經常於《紐約時報》、《金融時報》、
　　以及《外交事務》期刊發表財經文章。

第二章更多資料 (More)

使用巨量資料作研究，就像是釣魚，一開始的時候，非但不知道是否釣得到東西，連「釣得到什麼」也還在未定之數。

這是一個有趣的議題。那當"釣魚的成本"很大的時候，人們還願意去釣魚嗎? 要不就是把釣魚的成本降低，讓大家都能釣魚，不然就是讓會釣魚的漁夫來釣魚賣給大家。
另外，如果連"釣的到什麼"也是未知數時，會有漁夫願意投入去釣魚嗎?
這個概念就像，這裡有一池水，你去碰碰運氣吧!! ->這是大數據所遇到的困境。

大數據到底有多大呢?

巨量資料的「巨量」不是絕對，而是相對的概念。指的是要有"完整"的資料集。->這個概念很正確!!
但作者也提出了一些案例：

Google每天會處理超過24PB的資料。至1PB到底有多大?
史隆數位巡天計劃(Sloan Digital Sky Survey, SDSS)始於2000年，至2010年間所收集資訊已超過140TB。但將在2016年進行的綜合巡天望遠鏡計劃(Large Synoptic Survey Telescope, LSST)，只要5個工作天就可以得到同樣的資料量。
根據希伯特的計算，在2007年，全世界儲存了超過300EB的資料(1EB (exatyte =1000 petabyte). 但1EB到底有多大?

1Byte = 8 Bit
1 KB = 1,024 Bytes　
1 MB = 1,024 KB = 1,048,576 Bytes　
1 GB = 1,024 MB = 1,048,576 KB = 1,073,741,824 Bytes
1 TB = 1,024 GB = 1,048,576 MB = 1,073,741,824 KB = 1,099,511,627,776 Bytes
1 PB = 1,024 TB = 1,048,576 GB =1,125,899,906,842,624 Bytes
1 EB = 1,024 PB = 1,048,576 TB = 1,152,921,504,606,846,976 Bytes
1 ZB = 1,024 EB = 1,180,591,620,717,411,303,424 Bytes
1 YB = 1,024 ZB = 1,208,925,819,614,629,174,706,176 Bytes

統計學家證實，要提高抽樣的準確度，最好的方式並非增加樣本數，而是要作到隨機抽樣。 ->這個聽起來很驚人!!是吧!? 但這也是統計學樣本分析的理論基礎。但如果「樣本=母體」時，又時如何? 另外，統計中的的調查方法通常是有很大的人為誤差的。但現在大數據是讓人自然而然做自已做的事，研究者只是從旁被動收集資料，因此能夠避免過去各種抽樣和問卷調查的差。
其實，統計抽樣的概念只有不到三個糺紀的歷史，只是因為在歷史上的某個時刻，有了某種技術限制造成的問題，因而應運而生的解決方法。 -> 這也是說，統計是一種過渡的產物?

MR. MINING 發表在痞客邦留言(0) 人氣()

個人分類：Data Mining

▲top

Jul 31 Fri 2015 23:22
大數據就像青少年談性 (Big data is like teenage sex)

大數據就像青少年談性：每個人都在說，卻不知道誰作了。每個人都認為別人在作，所以每個人都聲稱自已在作。

這是Dan Ariely的在Face book中的一句玩笑話，卻道出了實際的情況。真正在作大數據的企業並不多，但是大家多躍躍欲試。

Dan Ariely

2013年1月7日 ·

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...

MR. MINING 發表在痞客邦留言(0) 人氣()

個人分類：Data Mining

▲top

Jul 05 Sun 2015 14:44
台灣Big data產業鏈

SCM Big data

資料來源：數位時代2015年7月

根據數位時代2015年的文章，台灣在Big data產業中，於資料分析及服務的拼圖上是缺乏的。

網通設備

明泰
智邦
友訊
正文
台勤

伺服器

鴻海
緯創
廣達
英業達
勤誠
和碩

儲存設備

普安
喬鼎

系統整合

精誠
聚碩
凌群
零壹
敦陽
資通
華碩
新鼎
台達電
Etu

資料分析

雲深 (http://www.cloudeep.com.tw/)

網業沒有內容

關貿 (http://www.yodass.com/)

Location
台北總公司
115台北市南港區三重路19-13號6樓
電話：(02)2655-1188(代表號)
傳真：(02)3789-5588

中部辦公室
407台中市西屯區市政北一路77號7樓之8
電話：(04)2259-2566
傳真：(04)2259-2688

高雄服務處
801高雄市前金區中正四路211號12樓之1
電話：(07)215-2066
傳真：(07)215-2088

科智 (http://www.servtech.com.tw/about.php)

科智企業股份有限公司的公司成員，大多來自於財團法人資訊工業策進會衍生創業(Spin-off)的專業技術服務團隊，其核心產品為關鍵製程資料之應用服務解決方案Servolution，主要是透過資通訊科技(Information and Communication Technology, ICT)蒐集整廠資訊，研發製造最佳化分析技術，協助設備加工廠提高稼動率並藉以提升供應鏈管理彈性，帶動製造業服務化模式創新。
所採用方法為整合前端感測裝置訊號，使前端機械設備及週邊機械可透過各式訊號源傳回，經由整合且標準化之通訊協定送往後端平台，進行資料分析、根本原因(Root Cause)分析以挖掘生產瓶頸，進而運用不同應用服務，來增強競爭優勢。
LOCATION

台北市大同區環河北路二段115號7樓

威朋 (http://www.vpon.com/zh-tw/)

成立於2008年，專注在行動裝置的行動廣告領域，憑藉強大的研發技術、海量數據處理分析，以及對品牌廣告主的商業拓展能力，Vpon威朋已服務超過1000家知名品牌，包括：麥當勞、可口可樂、美國運通、花旗銀行等。獨立使用者超過4.5億，廣告業務涵蓋東京、上海、廣州、香港、台北等750多個城市。目前Vpon威朋於上海/東京/台北/香港設有辨公室，是亞洲地區成長最迅猛的大數據廣告公司。Vpon威朋獲獎無數，2015年更獲《Forbes China富比士中國100強》選為中國非上市潛力企業第3名。
技術優勢亞洲首家LBS技術與行動應用廣告模式相結合的創新行動廣告服務提供

(繼續閱讀...)

MR. MINING 發表在痞客邦留言(0) 人氣()

▲top

Jun 28 Sun 2015 22:49
賭博五律 (Five gambling rules)

在上NPDP課程時，書中提到的賭博五律 (Five gambling rules)：

當不確定性高時，別賭太大。
當風險降時，逐漸加碼。
分批加注。
看每個階段逐漸降低不確定性，花小錢買資訊。
設定停損點，適時退場。

MR. MINING 發表在痞客邦留言(0) 人氣()

個人分類：Study

▲top

Jun 21 Sun 2015 23:16
前10大熱門data mining方法論 (第一名: C4.5)

C4.5，可以拿來做什麼呢?

C4.5是以決策樹形式呈現的的分類演算法。

等一等，什麼是分類(Classifier)呢?

分類(Classifier)是data mining的一種分析工具，它可以將一群資料進行分類，並進而預測新資料是落於那一個分類群。

有什麼範例嗎?

當然，假設有一群病人的資料，其中我們已得知每位病人的屬性(attributes)，如：年齡，血壓，脈博，及家族史等。好了，有了這些病人的屬性後，我們想要預測那些病人會得癌症。這群病人可以分為2類：1)會得癌症 2)不會得癌症。而我們將這些已知的病人屬性及其分類資料當成C4.5的輸入資料，而C4.5會根據新病人的屬性來預測新病人類別:1)會得癌症 2)不會得癌症。

概然C4.5的輸入資料類別是已知的，所以C4.5當然就是Supervised learning。

所以C4.5跟一般的決策樹有什麼不同呢?

C4.5是用Information gian

You might be wondering how C4.5 is different than other decision tree systems?

First, C4.5 uses information gain when generating the decision tree.
Second, although other systems also incorporate pruning, C4.5 uses a single-pass pruning process to mitigate over-fitting. Pruning results in many improvements.
Third, C4.5 can work with both continuous and discrete data. My understanding is it does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data.
Finally, incomplete data is dealt with in its own ways.

Why use C4.5? Arguably, the best selling point of decision trees is their ease of interpretation and explanation. They are also quite fast, quite popular and the output is human readable.

Where is it used? A popular open-source Java implementation can be found over at OpenTox. Orange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.

Classifiers are great, but make sure to checkout the next algorithm about clustering…

1 C4.5 and beyond

1.1 Introduction Systems that construct classifiers are one of the commonly used tools in data mining. Such systems take as input a collection of cases, each belonging to one of a small number of classes and described by its values for a fixed set of attributes, and output a classifier that can accurately predict the class to which a new case belongs. These notes describe C4.5 [64], a descendant of CLS [41] and ID3 [62]. Like CLS and ID3, C4.5 generates classifiers expressed as decision trees, but it can also construct classifiers in more comprehensible ruleset form. We will outline the algorithms employed in C4.5, highlight some changes in its successor See5/C5.0, and conclude with a couple of open research issues.

1.2 Decision trees Given a set S of cases, C4.5 first grows an initial tree using the divide-and-conquer algorithm as follows: • If all the cases in S belong to the same class or S is small, the tree is a leaf labeled with the most frequent class in S. • Otherwise, choose a test based on a single attribute with two or more outcomes. Make this test the root of the tree with one branch for each outcome of the test, partition S into corresponding subsets S1, S2,... according to the outcome for each case, and apply the same procedure recursively to each subset.

There are usually many tests that could be chosen in this last step. C4.5 uses two heuristic criteria to rank possible tests: information gain, which minimizes the total entropy of the subsets {Si } (but is heavily biased towards tests with numerous outcomes), and the default gain ratio that divides information gain by the information provided by the test outcomes. Attributes can be either numeric or nominal and this determines the format of the test outcomes. For a numeric attribute A they are {A ≤ h, A > h} where the threshold h is found by sorting S on the values of A and choosing the split between successive values that maximizes the criterion above. An attribute A with discrete values has by default one outcome for each value, but an option allows the values to be grouped into two or more subsets with one outcome for each subset. The initial tree is then pruned to avoid overfitting. The pruning algorithm is based on a pessimistic estimate of the error rate associated with a set of N cases, E of which do not belong to the most frequent class. Instead of E/N, C4.5 determines the upper limit of the binomial probability when E events have been observed in N trials, using a user-specified confidence whose default value is 0.25. Pruning is carried out from the leaves to the root. The estimated error at a leaf with N cases and E errors is N times the pessimistic error rate as above. For a subtree, C4.5 adds the estimated errors of the branches and compares this to the estimated error if the subtree is replaced by a leaf; if the latter is no higher than the former, the subtree is pruned. Similarly, C4.5 checks the estimated error if the subtree is replaced by one of its branches and when this appears beneficial the tree is modified accordingly. The pruning process is completed in one pass through the tree. C4.5’s tree-construction algorithm differs in several respects from CART [9], for instance: • Tests in CART are always binary, but C4.5 allows two or more outcomes. • CART uses the Gini diversity index to rank tests, whereas C4.5 uses information-based criteria. • CART prunes trees using a cost-complexity model whose parameters are estimated by cross-validation; C4.5 uses a single-pass algorithm derived from binomial confidence limits. • This brief discussion has not mentioned what happens when some of a case’s values are unknown. CART looks for surrogate tests that approximate the outcomes when the tested attribute has an unknown value, but C4.5 apportions the case probabilistically among the outcomes. 1.3 Ruleset classifiers Complex decision trees can be difficult to understand, for instance because information about one class is usually distributed throughout the tree. C4.5 introduced an alternative formalism consisting of a list of rules of the form “if A and B and C and ... then class X”, where rules for each class are grouped together. A case is classified by finding the first rule whose conditions are satisfied by the case; if no rule is satisfied, the case is assigned to a default class. C4.5 rulesets are formed from the initial (unpruned) decision tree. Each path from the root of the tree to a leaf becomes a prototype rule whose conditions are the outcomes along the path and whose class is the label of the leaf. This rule is then simplified by determining the effect of discarding each condition in turn. Dropping a condition may increase the number N of cases covered by the rule, and also the number E of cases that do not belong to the class nominated by the rule, and may lower the pessimistic error rate determined as above. A hill-climbing algorithm is used to drop conditions until the lowest pessimistic error rate is found.

To complete the process, a subset of simplified rules is selected for each class in turn. These class subsets are ordered to minimize the error on the training cases and a default class is chosen. The final ruleset usually has far fewer rules than the number of leaves on the pruned decision tree. The principal disadvantage of C4.5’s rulesets is the amount of CPU time and memory that they require. In one experiment, samples ranging from 10,000 to 100,000 cases were drawn from a large dataset. For decision trees, moving from 10 to 100K cases increased CPU time on a PC from 1.4 to 61 s, a factor of 44. The time required for rulesets, however, increased from 32 to 9,715 s, a factor of 300. 1.4 See5/C5.0 C4.5 was superseded in 1997 by a commercial system See5/C5.0 (or C5.0 for short). The changes encompass new capabilities as well as much-improved efficiency, and include: • A variant of boosting [24], which constructs an ensemble of classifiers that are then voted to give a final classification. Boosting often leads to a dramatic improvement in predictive accuracy. • New data types (e.g., dates), “not applicable” values, variable misclassification costs, and mechanisms to pre-filter attributes. • Unordered rulesets—when a case is classified, all applicable rules are found and voted. This improves both the interpretability of rulesets and their predictive accuracy. • Greatly improved scalability of both decision trees and (particularly) rulesets. Scalability is enhanced by multi-threading; C5.0 can take advantage of computers with multiple CPUs and/or cores. More details are available from http://rulequest.com/see5-comparison.html. 1.5 Research issues We have frequently heard colleagues express the view that decision trees are a “solved problem.” We do not agree with this proposition and will close with a couple of open research problems. Stable trees. It is well known that the error rate of a tree on the cases from which it was constructed (the resubstitution error rate) is much lower than the error rate on unseen cases (the predictive error rate). For example, on a well-known letter recognition dataset with 20,000 cases, the resubstitution error rate for C4.5 is 4%, but the error rate from a leave-one-out (20,000-fold) cross-validation is 11.7%. As this demonstrates, leaving out a single case from 20,000 often affects the tree that is constructed! Suppose now that we could develop a non-trivial tree-construction algorithm that was hardly ever affected by omitting a single case. For such stable trees, the resubstitution error rate should approximate the leave-one-out cross-validated error rate, suggesting that the tree is of the “right” size. Decomposing complex trees. Ensemble classifiers, whether generated by boosting, bagging, weight randomization, or other techniques, usually offer improved predictive accuracy. Now, given a small number of decision trees, it is possible to generate a single (very complex) tree that is exactly equivalent to voting the original trees, but can we go the other way? That is, can a complex tree be broken down to a small collection of simple trees that, when voted together, give the same result as the complex tree? Such decomposition would be of great help in producing comprehensible decision trees.

C4.5 Acknowledgments Research on C4.5 was funded for many years by the Australian Research Council. C4.5 is freely available for research and teaching, and source can be downloaded from http://rulequest.com/Personal/c4.5r8.tar.gz.

MR. MINING 發表在痞客邦留言(0) 人氣()

個人分類：Data Mining

▲top

Jun 21 Sun 2015 22:58
前10大熱門data mining方法論 (總排名)

Source: kdnuggets

這一系列文章將介紹前10大最具影響力之data mining方法論，而這10種方法論則是透過三種不同問卷平台所調查而得的。survey paper.

本篇將列出總排行榜，而在各篇文章中將介紹各方法論的基本原理及應用。敬請期待：

top-10-data-mining-algorithms

前10 大方法論的排名分別是：

C4.5
k-means
Support vector machines
Apriori
EM
PageRank
AdaBoost
kNN
Naive Bayes
CART

MR. MINING 發表在痞客邦留言(0) 人氣()

個人分類：Data Mining

▲top

Jul 16 Wed 2014 23:19
又是"英國研究"。統計的誤用

到底是英國人特別喜歡作奇怪的研究，還是台灣的媒體特別喜愛"誤"用英國的研究結果來大作文章。姑且不論其研究的可靠性，光目下標的新聞標題就夠醒目，讓你不尤自主地點進連結。

以統計的關點來看，不論其研究命題的方式，其大多犯了一個統計上常見的錯誤。

迴歸分析(Regression analysis)是統計上的一大利器，但如果你不了解其中的函義前就直接套用，相信你也可以導出像英國研究這樣的大膽假設，無心求證的結果。

常見有七大濫用的情況：

使用迴歸分析來分析非線性關係的問題。 (就像是硬要在分佈極遠的3個點之間劃上一條線，而就希望該線能代表3個點)
數據間具相關系並不等於具有因果關係。(某一時期美國自閉症人數上升，中國的GDP也上升。這並不意味其間有真接的因果關係)
顛倒的因果。(A 與B具相關性，並無法推論出是A導致B，還是B導致A。常見的英式作法就是看那種推論較為聳動就選那一個)
遺漏變數偏誤。(生活壓力大的人大多是"年輕人"，做愛次數本來就比年長者多。故其分析遺漏了"年紀"的這個變數)
高度相關的解釋變數(多重共線性)。(只有走路3次就可以......? 但常走路的人不代表就不從事其它運動。一個統計模型中放入太多高度相關的變數，會使分析的焦點模糊)
超出資料範圍的推測。(用身高來推算智力? 那你很能得到身高=負25公分這樣的可笑結果)
資料地雷(太多變數)：

MR. MINING 發表在痞客邦留言(0) 人氣()

▲top

蛋塔賣你 (Data Mining)