用余弦定理实现文本相似度算法

续上一篇文章的算法,这次自己改成PHP版本,当然如果有优化的地方请指出,以便大家学习。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
<?php
header('Content-Type: text/html; charset=UTF-8');
$text1 = <<<EOF
Fragment A (By Rousseau):
À l’instant, au lieu de la personne particulière de chaque contractant, cet acte d’association produit un Corps moral et collectif, composé d’autant de membres que l’assemblée a de voix, lequel reçoit de ce même acte son unité, son moi commun, sa vie et sa volonté. Cette personne publique, qui se forme ainsi par l’union de toutes les autres, prenait autrefois le nom de Cité , et prend maintenant celui de République ou de Corps politique: lequel est appelé par ses membres État quand il est passif, Souverain quand il est actif, Puissance en le comparant à ses semblables. À l’égard des associés, ils prennent collectivement le nom de peuple, et s’appellent en particulier citoyens, comme participant à l’autorité souveraine, et sujets, comme soumis aux lois de l’État. (Rousseau, Du contrat social, I.6)
EOF;

$text2 = <<<EOF
Fragment B (By Hobbes):
Art goes yet further, imitating that Rationall and most excellent worke of Nature, Man. For by Art is created that great LEVIATHAN called a COMMON-WEALTH, or STATE, (in latine CIVITAS) which is but an Artificiall Man; though of greater stature and strength than the Naturall, for whoseprotection and defence it was intended; and in which, the Soveraignty is an Artificiall Soul, as giving life and motion to the whole body; The Magistrates, and other Officers of Judicature and Execution, artificiall Joynts; Reward and Punishment (by which fastned to the seat of the Soveraignty, every joynt and member is moved to performe his duty) are the Nerves, that do the same in the Body Naturall; The Wealth and Riches of all the particular members, are the Strength; Salus Populi (the Peoples Safety) its Businesse; Counsellors, by whom all things needfull for it to know, are suggested unto it, are the Memory; Equity and Lawes, an artificiall Reason and Will; Concord, Health; Sedition, Sicknesse; and Civill War, Death. Lastly, the Pacts and Covenants, by which the parts of this Body Politique were at first made, set together, and united, resemble that Fiat, or the Let Us Make Man, pronounced by God in the Creation. (Hobbes, Leviathan, "Introduction")
EOF;


echo similarAlgorithm($text1 , $text2);
/*
* 文本相似度匹配
* @author juice
* @param text1 匹配文本1
* @param text2 匹配文本2
* @return double 返回相似度(余弦值)
*/

function similarAlgorithm($text1 = '' , $text2 = ''){
$charset = 'UTF-8' ;
$text1Array = mbStringToArray($text1 , $charset ,TRUE);
$text2Array = mbStringToArray($text2 , $charset ,TRUE);
$textSumArray = array();
foreach($text1Array as $key => $val){//统计该字在第一文本中出现的次数
if(isset($textSumArray[$val])){
$textSumArray[$val][0]++;
} else {
$textSumArray[$val][0] = 1;
$textSumArray[$val][1] = 0;
}
}
foreach($text2Array as $key => $val){//统计该字在第二文本中出现的次数
if(isset($textSumArray[$val])){
$textSumArray[$val][1]++;
}else {
$textSumArray[$val][0] = 0;
$textSumArray[$val][1] = 1;
}
}
$sqdoc1 = 0;//平方和
$sqdoc2 = 0;//平方和
$denominator = 0;
foreach($textSumArray as $key => $val){
$denominator += $val[0] * $val[1];
$sqdoc1 += $val[0] * $val[0];
$sqdoc2 += $val[1] * $val[1];
}
return $denominator / sqrt($sqdoc1*$sqdoc2);
}
/*
* 分割字符串方法 (支持中文分割)
* @author juice
* @param str 需要分割的字符串
* @param charset 字符串的编码,默认UTF-8
* @param convertedToHex 是否返回该字符的16进制,TRUE返回每个字符16进制,FALSE则返回原本字符
* @return double 返回相似度(余弦值)
*/

function mbStringToArray($str = '',$charset = 'UTF-8' , $convertedToHex = FALSE) {
$strlen = mb_strlen($str);
if( $convertedToHex === TRUE ){
while($strlen){
$array[] = bin2hex(mb_substr($str,0,1,$charset));
$str = mb_substr($str,1,$strlen,$charset);
$strlen = mb_strlen($str);
}
}else{
while($strlen){
$array[] = mb_substr($str,0,1,$charset);
$str = mb_substr($str,1,$strlen,$charset);
$strlen = mb_strlen($str);
}
}
return $array;
}
?>