公告版位111

目前分類:Data Mining (16)

瀏覽方式: 標題列表 簡短摘要

先貼幾個大家在數據分析工具的選用上的建議及偏好,提供給大家參考。

總結到最後,總是會因為SAS, SPSS等套裝軟體太貴而不被選用,而開源(open source)的數據分析工具中最被常使用的不外乎就是R和Python,所以常常有在到底要選擇R及Python的討論。

R及Python最大的差異點在於Python本身是通用性的程式語言,而R則是專門為數據/統計分析學門而存在的工具,所以Python在這方面就佔了極大優勢,因為Python除了可以拿來作數據分析外,還可以拿來處理更多的案例。而且在所謂「大數據 」的應用環境上,Python也提供許多不需額外coding的套件,使用大數據分析開發更為便捷。但是如果是真的要在真的大數據平台上開發的話,例如:Spark,那比較建議你還是使用它專門的編輯程式語言Scala。

總結:程式語言的更新與演進是不間斷的,怎麼學也學不完。如果在這個時間點你對Python及R都還不熟的話,那就先學Python吧!

  •  

  •  

MR. MINING 發表在 痞客邦 留言(0) 人氣()

原文:http://blog.import.io/post/20-questions-to-detect-...

數據科學家是正式的21世紀最性感的工作,每個人都想分一杯羹。人們誰稱自己數據的科學家,但誰實際上並沒有正確的技能。很多人可能會認為他們是數據科學家,純粹是因為他們處理的數據。 

kirk.jpg

“假的數據科學家往往是專家在一個特定的領域,堅持認為他們的領域是唯一真正的數據科學。這種信念的數據科學是指科學的工具和技術(數學,計算,正視化,分析,統計,實驗,問題的定義,建立模型和驗證等),以獲得新發現的全部武器的應用程序來看,洞察力和價值,從數據收集。“

柯克源性,在主要的數據科學家博思艾倫諮詢公司和創始人RocketDataScience.org

為了幫助您排序從假冒(或誤導)一個真正的數據科學家,我們已經整理20個面試問題:


  1. 解釋什麼正規化,以及為什麼它是非常有用的。 Explain what regularization is and why it is useful. 
  2. 哪些數據科學家做你最欣賞?該初創公司?Which data scientists do you admire most? which startups?
  3. 你將如何驗證您創建生成採用多元回歸定量結果變量的預測模型的模型。 How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression. 
  4. 解釋什麼精度和召回。他們如何涉及到ROC曲線?Explain what precision and recall are. How do they relate to the ROC curve?
  5. 你怎麼能證明你帶到一個算法的一個改進是一個真正的改進,沒有做任何事情? How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? 
  6. 什麼是根本原因分析?What is root cause analysis?
  7. 你熟悉定價優化,價格彈性,庫存管理,競爭情報?舉例說明。Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples. 
  8. 什麼是統計力量?What is statistical power?
  9. 解釋什麼是重採樣方法是,為什麼他們是有用的。同時解釋其局限性。Explain what resampling methods are and why they are useful. Also explain their limitations.
  10. 它是更好地有太多的誤報,或者太多的假陰性?說明。Is it better to have too many false positives, or too many false negatives? Explain.
  11. 什麼是選擇偏倚,為什麼它很重要,你怎麼能避免呢? What is selection bias, why is it important and how can you avoid it? 
  12. 列舉一個你將如何使用實驗設計來回答一個關於用戶行為的問題的一個例子。Give an example of how you would use experimental design to answer a question about user behavior. 
  13. “長”和“寬”格式數據之間的區別是什麼?What is the difference between "long" and "wide" format data?
  14. 你用什麼方法來確定是否公佈的統計資料中的文章(如報紙)或者是錯誤的或者提出支持作者的觀點,而不是在一個特定的主題正確,全面真實的信息?What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject?
  15. 解釋愛德華·塔夫特的理念,以“圖表垃圾”。Explain Edward Tufte's concept of "chart junk."
  16. 你會如何篩選異常值,你應該怎樣做,如果你找到一個? How would you screen for outliers and what should you do if you find one? 
  17. 如何將您使用的極值理論,蒙特卡洛模擬或數理統計(或其他東西)正確估計一個非常罕見的事件的機會呢?How would you use either the extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
  18. 什麼是推薦引擎?它是如何工作的?What is a recommendation engine? How does it work?
  19. 解釋一下什麼是假陽性和假陰性的。為什麼是重要的相互區分這些? Explain what a false positive and a false negative are. Why is it important to differentiate these from each other? 
  20. 你使用的可視化工具的哪一個?你怎麼想的Tableau的?R' SAS?(用於圖形)。如何有效地表示5維的圖表(或視頻)?
  21. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?


你如何來量化一個真正的數據科學家?

MR. MINING 發表在 痞客邦 留言(0) 人氣()

 More.....

Bid data  

大數據

Big Data:A Revolution That Will Transform How We Live, Work, and Think

 

作者簡介

麥爾荀伯格 Viktor Mayer-Schonberger

  牛津大學網路研究所教授,
  並擔任微軟、世界經濟論壇等大公司和組織的顧問,
  是大數據(巨量資料)領域公認的權威,
  寫過八本書以及上百篇專論。

庫基耶 Kenneth Cukier

  《經濟學人》雜誌資料編輯,巨量資料思潮評論員,
  經常於《紐約時報》、《金融時報》、
  以及《外交事務》期刊發表財經文章。

 第二章 更多資料 (More)

  • 使用巨量資料作研究,就像是釣魚,一開始的時候,非但不知道是否釣得到東西,連「釣得到什麼」也還在未定之數。
    • 這是一個有趣的議題。 那當"釣魚的成本"很大的時候,人們還願意去釣魚嗎? 要不就是把釣魚的成本降低,讓大家都能釣魚,不然就是讓會釣魚的漁夫來釣魚賣給大家。
    • 另外,如果連"釣的到什麼"也是未知數時,會有漁夫願意投入去釣魚嗎?
    • 這個概念就像,這裡有一池水,你去碰碰運氣吧!! ->這是大數據所遇到的困境。
  • 大數據到底有多大呢?
    • 巨量資料的「巨量」不是絕對,而是相對的概念。指的是要有"完整"的資料集。->這個概念很正確!!
    • 但作者也提出了一些案例:
    1. Google每天會處理超過24PB的資料。至1PB到底有多大?
    2. 史隆數位巡天計劃(Sloan Digital Sky Survey, SDSS)始於2000年,至2010年間所收集資訊已超過140TB。但將在2016年進行的綜合巡天望遠鏡計劃(Large Synoptic Survey Telescope, LSST),只要5個工作天就可以得到同樣的資料量。
    3. 根據希伯特的計算,在2007年,全世界儲存了超過300EB的資料(1EB (exatyte =1000 petabyte). 但1EB到底有多大?
    • 1Byte = 8 Bit 
    • 1 KB = 1,024 Bytes  
    • 1 MB = 1,024 KB = 1,048,576 Bytes  
    • 1 GB = 1,024 MB = 1,048,576 KB = 1,073,741,824 Bytes 
    • 1 TB = 1,024 GB = 1,048,576 MB = 1,073,741,824 KB = 1,099,511,627,776 Bytes 
    • 1 PB = 1,024 TB = 1,048,576 GB =1,125,899,906,842,624 Bytes 
    • 1 EB = 1,024 PB = 1,048,576 TB = 1,152,921,504,606,846,976 Bytes 
    • 1 ZB = 1,024 EB = 1,180,591,620,717,411,303,424 Bytes 
    • 1 YB = 1,024 ZB = 1,208,925,819,614,629,174,706,176 Bytes 
  • 統計學家證實,要提高抽樣的準確度,最好的方式並非增加樣本數,而是要作到隨機抽樣。 ->這個聽起來很驚人!!是吧!? 但這也是統計學樣本分析的理論基礎。但如果「樣本=母體」時,又時如何? 另外,統計中的的調查方法通常是有很大的人為誤差的。但現在大數據是讓人自然而然做自已做的事,研究者只是從旁被動收集資料,因此能夠避免過去各種抽樣和問卷調查的差。
  • 其實,統計抽樣的概念只有不到三個糺紀的歷史,只是因為在歷史上的某個時刻,有了某種技術限制造成的問題,因而應運而生的解決方法。 -> 這也是說,統計是一種過渡的產物?

MR. MINING 發表在 痞客邦 留言(0) 人氣()

大數據就像青少年談性:每個人都在說,卻不知道誰作了。每個人都認為別人在作,所以每個人都聲稱自已在作。

這是Dan Ariely的在Face book中的一句玩笑話,卻道出了實際的情況。真正在作大數據的企業並不多,但是大家多躍躍欲試。

 

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...

 

 

 

MR. MINING 發表在 痞客邦 留言(0) 人氣()

C4.5,可以拿來做什麼呢?  

C4.5是以決策樹形式呈現的的分類演算法。

 

等一等,什麼是分類(Classifier)呢?

分類(Classifier)是data mining的一種分析工具,它可以將一群資料進行分類,並進而預測新資料是落於那一個分類群。

 

有什麼範例嗎?

當然,假設有一群病人的資料,其中我們已得知每位病人的屬性(attributes),如:年齡,血壓,脈博,及家族史等。好了,有了這些病人的屬性後,我們想要預測那些病人會得癌症。這群病人可以分為2類:1)會得癌症 2)不會得癌症。 而我們將這些已知的病人屬性及其分類資料當成C4.5的輸入資料,而C4.5會根據新病人的屬性來預測新病人類別:1)會得癌症 2)不會得癌症。

 

 

概然C4.5的輸入資料類別是已知的,所以C4.5當然就是Supervised learning。

 

所以C4.5跟一般的決策樹有什麼不同呢?

  1. C4.5是用Information gian

You might be wondering how C4.5 is different than other decision tree systems?

  • First, C4.5 uses information gain when generating the decision tree.
  • Second, although other systems also incorporate pruning, C4.5 uses a single-pass pruning process to mitigate over-fitting. Pruning results in many improvements.
  • Third, C4.5 can work with both continuous and discrete data. My understanding is it does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data.
  • Finally, incomplete data is dealt with in its own ways.

 

 

 

 

 

 

Why use C4.5? Arguably, the best selling point of decision trees is their ease of interpretation and explanation. They are also quite fast, quite popular and the output is human readable.

Where is it used? A popular open-source Java implementation can be found over at OpenToxOrange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.

Classifiers are great, but make sure to checkout the next algorithm about clustering…

 

 

 

1 C4.5 and beyond

1.1 Introduction Systems that construct classifiers are one of the commonly used tools in data mining. Such systems take as input a collection of cases, each belonging to one of a small number of classes and described by its values for a fixed set of attributes, and output a classifier that can accurately predict the class to which a new case belongs. These notes describe C4.5 [64], a descendant of CLS [41] and ID3 [62]. Like CLS and ID3, C4.5 generates classifiers expressed as decision trees, but it can also construct classifiers in more comprehensible ruleset form. We will outline the algorithms employed in C4.5, highlight some changes in its successor See5/C5.0, and conclude with a couple of open research issues.

1.2 Decision trees Given a set S of cases, C4.5 first grows an initial tree using the divide-and-conquer algorithm as follows: • If all the cases in S belong to the same class or S is small, the tree is a leaf labeled with the most frequent class in S. • Otherwise, choose a test based on a single attribute with two or more outcomes. Make this test the root of the tree with one branch for each outcome of the test, partition S into corresponding subsets S1, S2,... according to the outcome for each case, and apply the same procedure recursively to each subset.

 

There are usually many tests that could be chosen in this last step. C4.5 uses two heuristic criteria to rank possible tests: information gain, which minimizes the total entropy of the subsets {Si } (but is heavily biased towards tests with numerous outcomes), and the default gain ratio that divides information gain by the information provided by the test outcomes. Attributes can be either numeric or nominal and this determines the format of the test outcomes. For a numeric attribute A they are {A ≤ h, A > h} where the threshold h is found by sorting S on the values of A and choosing the split between successive values that maximizes the criterion above. An attribute A with discrete values has by default one outcome for each value, but an option allows the values to be grouped into two or more subsets with one outcome for each subset. The initial tree is then pruned to avoid overfitting. The pruning algorithm is based on a pessimistic estimate of the error rate associated with a set of N cases, E of which do not belong to the most frequent class. Instead of E/N, C4.5 determines the upper limit of the binomial probability when E events have been observed in N trials, using a user-specified confidence whose default value is 0.25. Pruning is carried out from the leaves to the root. The estimated error at a leaf with N cases and E errors is N times the pessimistic error rate as above. For a subtree, C4.5 adds the estimated errors of the branches and compares this to the estimated error if the subtree is replaced by a leaf; if the latter is no higher than the former, the subtree is pruned. Similarly, C4.5 checks the estimated error if the subtree is replaced by one of its branches and when this appears beneficial the tree is modified accordingly. The pruning process is completed in one pass through the tree. C4.5’s tree-construction algorithm differs in several respects from CART [9], for instance: • Tests in CART are always binary, but C4.5 allows two or more outcomes. • CART uses the Gini diversity index to rank tests, whereas C4.5 uses information-based criteria. • CART prunes trees using a cost-complexity model whose parameters are estimated by cross-validation; C4.5 uses a single-pass algorithm derived from binomial confidence limits. • This brief discussion has not mentioned what happens when some of a case’s values are unknown. CART looks for surrogate tests that approximate the outcomes when the tested attribute has an unknown value, but C4.5 apportions the case probabilistically among the outcomes. 1.3 Ruleset classifiers Complex decision trees can be difficult to understand, for instance because information about one class is usually distributed throughout the tree. C4.5 introduced an alternative formalism consisting of a list of rules of the form “if A and B and C and ... then class X”, where rules for each class are grouped together. A case is classified by finding the first rule whose conditions are satisfied by the case; if no rule is satisfied, the case is assigned to a default class. C4.5 rulesets are formed from the initial (unpruned) decision tree. Each path from the root of the tree to a leaf becomes a prototype rule whose conditions are the outcomes along the path and whose class is the label of the leaf. This rule is then simplified by determining the effect of discarding each condition in turn. Dropping a condition may increase the number N of cases covered by the rule, and also the number E of cases that do not belong to the class nominated by the rule, and may lower the pessimistic error rate determined as above. A hill-climbing algorithm is used to drop conditions until the lowest pessimistic error rate is found.

 

To complete the process, a subset of simplified rules is selected for each class in turn. These class subsets are ordered to minimize the error on the training cases and a default class is chosen. The final ruleset usually has far fewer rules than the number of leaves on the pruned decision tree. The principal disadvantage of C4.5’s rulesets is the amount of CPU time and memory that they require. In one experiment, samples ranging from 10,000 to 100,000 cases were drawn from a large dataset. For decision trees, moving from 10 to 100K cases increased CPU time on a PC from 1.4 to 61 s, a factor of 44. The time required for rulesets, however, increased from 32 to 9,715 s, a factor of 300. 1.4 See5/C5.0 C4.5 was superseded in 1997 by a commercial system See5/C5.0 (or C5.0 for short). The changes encompass new capabilities as well as much-improved efficiency, and include: • A variant of boosting [24], which constructs an ensemble of classifiers that are then voted to give a final classification. Boosting often leads to a dramatic improvement in predictive accuracy. • New data types (e.g., dates), “not applicable” values, variable misclassification costs, and mechanisms to pre-filter attributes. • Unordered rulesets—when a case is classified, all applicable rules are found and voted. This improves both the interpretability of rulesets and their predictive accuracy. • Greatly improved scalability of both decision trees and (particularly) rulesets. Scalability is enhanced by multi-threading; C5.0 can take advantage of computers with multiple CPUs and/or cores. More details are available from http://rulequest.com/see5-comparison.html. 1.5 Research issues We have frequently heard colleagues express the view that decision trees are a “solved problem.” We do not agree with this proposition and will close with a couple of open research problems. Stable trees. It is well known that the error rate of a tree on the cases from which it was constructed (the resubstitution error rate) is much lower than the error rate on unseen cases (the predictive error rate). For example, on a well-known letter recognition dataset with 20,000 cases, the resubstitution error rate for C4.5 is 4%, but the error rate from a leave-one-out (20,000-fold) cross-validation is 11.7%. As this demonstrates, leaving out a single case from 20,000 often affects the tree that is constructed! Suppose now that we could develop a non-trivial tree-construction algorithm that was hardly ever affected by omitting a single case. For such stable trees, the resubstitution error rate should approximate the leave-one-out cross-validated error rate, suggesting that the tree is of the “right” size. Decomposing complex trees. Ensemble classifiers, whether generated by boosting, bagging, weight randomization, or other techniques, usually offer improved predictive accuracy. Now, given a small number of decision trees, it is possible to generate a single (very complex) tree that is exactly equivalent to voting the original trees, but can we go the other way? That is, can a complex tree be broken down to a small collection of simple trees that, when voted together, give the same result as the complex tree? Such decomposition would be of great help in producing comprehensible decision trees.

 

C4.5 Acknowledgments Research on C4.5 was funded for many years by the Australian Research Council. C4.5 is freely available for research and teaching, and source can be downloaded from http://rulequest.com/Personal/c4.5r8.tar.gz.

MR. MINING 發表在 痞客邦 留言(0) 人氣()

Source: kdnuggets

這一系列文章將介紹前10大最具影響力之data mining方法論,而這10種方法論則是透過三種不同問卷平台所調查而得的。survey paper.

本篇將列出總排行榜,而在各篇文章中將介紹各方法論的基本原理及應用。敬請期待:

 

top-10-data-mining-algorithms

前10 大方法論的排名分別是:

  1. C4.5
  2. k-means
  3. Support vector machines
  4. Apriori
  5. EM
  6. PageRank
  7. AdaBoost
  8. kNN
  9. Naive Bayes
  10. CART

MR. MINING 發表在 痞客邦 留言(0) 人氣()

XCS (Extend Classify System)主要系為了改善LCS(Learning Classifier System)機制中之缺點,目的在於使各知識法則在相互競爭中可以真正的反應其實力,改善在知識法則競爭過程中之不公平行為,例如:

  • 競爭不公平:各知識法則是否只能拿其擁有分數的10%來競爭? 如此將會讓(先賺2年再輸2) 的知識法則與(先輸2年再賺2)的知識法則在評分上產生不公平。
  • 獎罰不公平: 連續猜對的知識法則是否應給較高的獎利?
  • 交配不公平:"看多"的知識法則是否只能跟"看多"的交配?

改善項目:

XCS利用準確度(p),錯誤率(e)Fitness (F)來進行競爭而非單一指標。

MR. MINING 發表在 痞客邦 留言(0) 人氣()

應用:LCS (Learning Classifier System) 是一個機器學習(machine learning)的系統,其主要為了解決在動態環境中,對於問題解答品質的改善。

原理:透過讓多個的知識法則之間進行競爭,根據各知識法則在特定環境所產出的結果給予評分。並透過基因演算法讓得分較高的知識法則族群進行交配、複制、突變產生新知識法則,並將新法則法則一同競爭。如可讓較優秀的法則勝出。

範例:假設初始有50個知識規則,每一個知識規則各自100分的點數開始進行競爭,猜對者+1%分,猜錯者扣-1%分。

第一天15個知識規格符合第一天的決策環境,猜對的加分(100 + 100x1%=101分),猜錯的扣分(100-100x1%=99分)

第二天假設有10個知識規則符合第二天的環境,其中有3個已在第一天猜對,有6個在已第一天猜錯,另外有1個在第一天未作答。則:

  第一天猜對,第二天猜對 : 101 + 101x1%

  第一天猜對,第二天猜錯: 101 - 101x1%

  第一天猜錯,第二天猜對: 99 + 99 x 1%

  第一天猜錯,第二天猜錯: 99 - 99 x 1%

  第一天末作答,第二天猜對: 100 + 100 x 1%

  第一天末作答,第二天猜錯: 100 - 100 x 1%

  依此規則進行100天之競賽,並取前20名知識規則進行交配、複製、突變後,所產生之新的知識法則再與原有之知識法則再進行競爭。

系統架構圖:

 

LCS  

 

 

 

MR. MINING 發表在 痞客邦 留言(0) 人氣()

異常值的偵測可以應用於許多領域上,但也常會遇到以下的挑戰:

  1. 很難有效地為正常/異常物件建置模型: 通常而言,正常與異常的界限並不十分明確,而且存在著很大的灰色地帶。甚至有一些異常值偵測(Outlier Detection)的方法論不會直接為物件定義(正常)或(異常),而是給每個物件一個所謂的異常分數(Outlier-ness)。
  2. 特定領域的異常偵測:技術上來說,在異常偵測中,相似度及距離量測的方法選用是相當關鍵的。但不幸的是,這些方法的選擇通常與其應用領域有極大的相關。不同的應用領域會有不用的需求。例如:在醫學資料的分析上,很小的變異程度就足夠被定義為異常。但相對的,在市場分析上,所分析的物件通常有很大的變異,所以可能需要用比較大的變異才能用來定義市場上的異常。所以異常偵測通常很難被發展為一通用的方法。
  3. 在異常偵測中的雜音處理(Handling Noise): 如同上一篇文章中所提到的,異常(Outlier)與雜音(Noise)是不同的。而且大家也都有所認知,我們所分析的資料的品質一般是不好的。在這之中,雜音(Noise)可能以變異或是遺值(data missing)的方式出現,而將真正的異常(Outlier)給隱藏了。
  4. 可解讀性(Understandability):在某些應用領域中,使用者可能不只要找出異常值,更想要了解這些物件為何是異常值。為了要達到這個"可解讀性"的需求,在設計異常值偵測的方法時,則必需給予充分的理由。

MR. MINING 發表在 痞客邦 留言(0) 人氣()

一般而言,異常值(Outlier)是指其資料顯著地別於其它的資料物件,且它是因為不同的機製所產生的。雖然有時我們把outlier稱為abnormal data,但並非所有的outlier都要排除。例如在信用卡詐騙偵測(credit card fraud detection)的分析上,客戶的abnormal行為就是分析上重要的特點。

而雜音(nosie)則是因隨機產生的誤差或變數,在資料探勘分析上,通常在異常值偵測(outlier detection)分析前,會先將雜音(nosie)去除。

異常值(Outlier)可大致分為三大類

Global outlier

也稱為 'point anomalies'。是指該資料有顯著地別於其它的資料集合,大多數的outlier detection都是在於尋找這類的異常值。

要有效偵測這類的異常,最重要的議題就是需針對欲分析的主題是要找到一個合適地衡量誤差(deviation)方法。用各種不同的方法轉換後,再用outlier detection的方法進行分類。

 Contextual outlier

也稱為conditionl outlier"。"今天的溫度是38度" <-這算是一個異常值(outler)嗎? 這很難有一個肯定的答案,因為仍需看這是在什麼'時間'及'地點'才能決定是否為異常值。所以contextual outlier需根據資料的特定內容(contex)來進行分析

  • Contextual attributes
  • Behavioral attributes

 Collective outlier

"每個月缺料10次"<-這是一個異常值(outler)嗎? 也許不是,但如果這10次缺料都發生在同一天,這就是很明顯的異常了。個別的資料並沒有異常,但如果這些資料都發生在特定的子集合上時,則可能為異常。故在異常偵淵上,我們也需要了解背景資料,才能有效分析。



MR. MINING 發表在 痞客邦 留言(0) 人氣()

目前公司的跨國專案(Be-Warned)有在台灣提供Intership的機會,有興趣的人請參考。這是一個非常有挑戰的大型專案,可以落實Data mining及Predictive modeling的概念於實際應用。

請參考: 104 ASML Data Analyst (4 months Internship; 4個月實習機會)

職務說明:

Sector Information:
Be-Warned project plan to deliver the predictive model solution at July, and we currently have 110 models which are collected from WW offices but the quality is unknown. In order to secure the models we delivered are all with good prediction quality, the in-house model validation and fine-tune is necessary. However, due to the fact to test a model cost lots of time, we propose to hire the intern to speed up the progress and ensure we can deliver the deliverable on time.

Job Describtion:
    • Based on the predictive model building guideline to verify whether the model information is sufficient or not. 
    • Test the model performance based on model definition. 
    • Generate model quality report with the pre-defined template. 
    • Provide model fine tune suggestion.
學歷要求:大學以上
科系要求:統計學相關、一般商業學類
語文條件:英文 -- 聽 /精通、說 /精通、讀 /精通、寫 /精通
其他條件:

• Must be strong in discipline and willing to follow the guideline 
• Attention to the details 
• Basic statistical skill, data-mining skill is plus. 
• Good at Office excel, VBA is plus 
• Good communication skill (willing to listen and give suggestions / feedbacks) 
• Excellent English written and verbal communication skills.

更新日期:2012-03-19

連 絡 人:Ms. Evelyn

MR. MINING 發表在 痞客邦 留言(2) 人氣()

列出幾個較具知名度的Data mining廠商名稱及其解決方案,因為當初在找尋合作廠商時較偏重於 Predictive Analysis and Monitoring的解決方案,故此清單可能不完全,若後續有更完整的將再補上。

 

Data mining Vendor name and it's solution

SAS: SAS Predictive Asset Maintenance (PAM)

IBM: IBM Advanced Analytics solution Pixel fine (Reporting)

Oracle: Oracle Data Mining (ODM)

Applied Materials: E3, Advanced Data mining report

KXEN: KXEN products (K2R, K2S, ...)

Statsoft: STATISTICA Data Miner

Fico: FICO Predictive Analytics

SMI: iTrend

INSTEP: PRiSM

Pdf solutions: Angoss Analytics Software

SAP

Mathworks

Swantech

 

 

MR. MINING 發表在 痞客邦 留言(0) 人氣()

一般的研究學者,為了能讓人對於其研究成果有耳目一新,所下的標題或結論皆需要使用非常地聳動,以吸引大家目光之焦點。即使如此最終容易招來批判,但也藉此累積了知名度。

雖然我認為此風不可長,但是這比起其它只是為了衝高研究論文篇數而發表的文章,或研究成果卻乏善可陳的文章,仍是可以給予鼓勵的。

 

話說前幾天在台灣碩博士論文網上找資料時,突然發現有一篇論文的摘要就是犯了這類錯誤: 

"研究結果,三個演算法所產生的結果十分類似,即在下雨的預測規則方面,當能見度轉差天空總雲量變多的時候,就有機會下雨;另外在晴天的預測規則方面,則是當能見度為佳天空總雲量為少的時候,當時的天氣可推測為晴天。 "
 
仔細讀了好幾次,仍是難以理解此篇文章的重點為何?重大發現為何?

MR. MINING 發表在 痞客邦 留言(0) 人氣()

什麼是SVM (Support Vector Machine)呢? 

SVM是一個可以同時運用於線性及非線性資料的分類演算法。它將原始資料轉換為更高的維度(dimension),從這些維度上,它可以利用在訓練資料集(Training data set)所謂的Super vectors來找到hyper plane,以用來將資料分類。 而其中的Support vector:是指在訓練資料集(Training data set)中,用於分類上給予最多資訊的點,其會座落如圖中所示的虛線上。

1.png  

 

SVM主要就是在尋找具有最大margin的hyper-plane,也就是Maximum marginal hyper-plane (MMH),因為其具有較高的分類準確性。

 

通常,一個線性的separation hyper-plane可以寫成 : WX + b = 0

 

其中 W={w1, w2, …, wn} is a weight vector and b a scalar (bias)


MR. MINING 發表在 痞客邦 留言(0) 人氣()

Jaccard相似度主要的概念為 = (交集) / (聯集) ; 而這裡的聯集是指只要有任何一個項目成立即可,而交集則是兩者都必需同時成立。

舉例來說,若要求得Lee與Ken的相似度,以及Meg與Nan的相似,其計算如下:

1.png

 Jaccard Similarity (Lee, Ken) = 3/6 =0.5

Jaccard Similarity (Meg, Nan) = 1/6 =0.167 


當然,你也可透過 另外一種方式來解讀Jaccard相似度,下表為Contingency table,主要在探討item i與item j之間的相似度 

 1.png  

 

MR. MINING 發表在 痞客邦 留言(0) 人氣()

以項目為基礎的協同過慮 (Item-Based Collaborative Filtering), 主要是用來預測User a對Item j的評比,藉此給予適當的推薦( Recommendation)。

其透過1) 不同item之間的相似度; 以及 2)user a 對於不同item之評比值來預測其會對item j的喜好程度。公式如下:

1.png

其中:Sij 為item i 與item j之相似度; rai為user a對於item i的評分。舉例來說,若要預測user a 對於item 4的喜好程度,計算如下: 

1.png

1.png

而至於各item之間的相似度,則可使用Adjusted Cosine來衡量,其公式如下:

1.png

其中

1.png  

1.png

 

舉例來說,若欲計算item i與item j之間的相似度,其計算如下: 

1.png

 

1.png    

 

 Item-Base CF 主要可解決User-Based CF 中隨著使用者人數增多而造成運算時間大幅增加的問題。由Sarwar在2001年Sarwar提出。

 

 

交通大學李俊昇

業智慧

MR. MINING 發表在 痞客邦 留言(0) 人氣()