Awesome
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><HTML><!-- v:10.0.7.13 --><HEAD
<META content="IE=9.0000" http-equiv="X-UA-Compatible">
<DIV id="contentArea">
<DIV class="zone" id="mainZone">
<DIV class="compositeModule_1Zone ">
<DIV class="zone">
<DIV class="title deM"><B><H2>Employing Semantic Context for Sparse Information Extraction Assessment</H2></B>
<DIV class="cl"></DIV></DIV>
<DIV class="conM ">
<P> </P>
<P>We address two problems in this paper. We first want to veri- fy the
correctness of hundreds of millions of isA relationships. That is,
given a candidate pair <c,e>, we want to evaluate how
likely e is an entity of class c. Second, given a candidate pair
<e1, e2>, and a known relationship R between classes c1 and
c2, we want to evaluate whether relationship R exists between
e1 and e2.</P>
<H2>Introduction</H2>
<P>The explosive growth and popularity of the World Wide
Web has resulted in a huge amount of texts on the
Internet, which presents an unprecedented opportunity for Information Extraction (IE).
IE is at the core of many emerging applications,
such as entity search, text mining,
and risk analysis using financial reports.
In these applications, we can divide the outcome of IE into two categories according to the frequency: heads and tails. The heads are those that occur very
frequently in the corpus. For instance, we can extract the fact that "google is a company" from numerous distinct
sentences. It is built on the assumption that the higher the frequency, the more likely
it is correct. Nevertheless, there are results that occur very
infrequently, for instance, suppose from a corpus we extract a
statement that says Rhodesia(Rhodesia was an
unrecognised state located in southern Africa that existed between
1965 and 1979 following its Unilateral Declaration of Independence
from the United Kingdom on 11 November 1965.) is a country, and
its occurrences in the corpus are few and far between. In
Table 1, we show some frequent and rare candidate
countries extracted from a web corpus using Hearst
patterns. It turns out that all frequent entities are correct, while
the majority of infrequent ones are incorrect. The mistakes come from
either the extraction algorithm, or erroneous sentences in
the corpus.</P>
<P align="left"><B>Table 1: Frequent and infrequent candidate entities of country</B></P>
<table align="center">
<TR>
<TD align="center"><B>Frequent Entities</B></TD>
<TD align="center"><B>Rare Entities</B></TD>
</TR>
<TR>
<TD align="center">India</TD>
<TD align="center">Northern</TD>
</TR>
<TR>
<TD align="center">China</TD>
<TD align="center">Sabah</TD>
</TR>
<TR>
<TD align="center">Germany</TD>
<TD align="center">Yap</TD>
</TR>
<TR>
<TD align="center">Australia</TD>
<TD align="center">Parts of sudan</TD>
</TR>
<TR>
<TD align="center">Japan</TD>
<TD align="center">Wealthy</TD>
</TR>
<TR>
<TD align="center">France</TD>
<TD align="center">Western romania</TD>
</TR>
<TR>
<TD align="center">Canada</TD>
<TD align="center">American artists</TD>
</TR>
<TR>
<TD align="center">USA</TD>
<TD align="center">South korea japan</TD>
</TR>
<TR>
<TD align="center">Brazil</TD>
<TD align="center">New sjaelland</TD>
</TR>
<TR>
<TD align="center">Italy</TD>
<TD align="center">Rhodesia</TD>
</TR>
</table>
<P>How to verify the correctness of a tail extraction
(also known as sparse extraction) is one of the most
important and challenging problems in IE. As we know,
the distribution of words and phrases in a corpus of
natural language utterances follows the Zipf's law which states that the frequency of any word or phrase is
inversely proportional to its rank in the frequency table.
Thus, their occurrences in a particular syntactic pattern
we use for extraction are very small. Without a good
mechanism to identify correct extractions from incorrect
ones, sparse information extraction will be plagued by
either low precision or low recall.</P>
<P>Existing efforts in information extraction or sparse extraction
can be divided into the following four classes. Heuristic based approaches start with a set of seed
entities given a relation or some prior label distributional
knowledge, and identify extraction patterns for
the relation iteratively. Redundancy-based approaches require that extractions
appear relatively frequently with a limited set of patterns. Knowledge-based approaches identify information extraction
in terms of external resources, such as Wikipedia, Freebase and WordNet. In addition, most of popular approaches in handling
of sparse extractions are context-based model building
approaches. They use one important hypothesis known
as the distributional hypothesis, which says that different
entities of the same semantic relation (such as the
unary and binary relations) tend to appear in similar
textual contexts. For example, we may not find many
occurrences of Rhodesia in the Hearst pattern "countries
such as Rhodesia". But if Rhodesia appears in similar
context where terms such as India, USA, and Germany
occur, then we will be more certain about the claim
that Rhodesia is a country according to the distributional
hypothesis. This hypothesis is beneficial to assess sparse
extractions.However, the challenge lies in modeling contexts and
measuring the semantic similarity of two contexts.</P>
<H2>Our Semantic Context based Approach for Assessing Sparse Information Extractions</H2>
<P>We now analyze the challenges in the tasks. The first
challenge is the scale. For example, there are hundreds of
millions of isA relationships (formed among 2.7 million
categories and 5.5 million entities in Probase[1][2]). It is
impossible to learn the generative model (such as the
HMM model and the deep learning model) based on
the contexts of all entities, it is very time-consuming. The
second challenge lies in improving the effectiveness of
the verifier. As we mentioned, the feature representation
based on the contexts of words are very different that
based on the contexts of entities. Meanwhile, neither a
bag of words nor a set of hidden states can provide
good semantics to understand the relationship between
a candidate pair.
Motivated by this, in this paper, we introduce a
semantic, scalable, and effective approach for sparse
information extraction assessment.
<P><H3>The main contributions of this paper are as follows.</H3></P>
<P>First, we introduce a semantic approach for solving the two problems. More precisely, we
come up with a semantic representation of the contexts. This approach is natural because we
are dealing with a large semantic network, which provides semantic information in various
aspects. Using these information, we are able to introduce semantic features to describe a
context, which leads to a lightweight and effective solution of context learning.</P>
<P>Second, we scan billions of web documents using MapReduce6 to capture the contexts of
millions of entities and pairs of entities in Probase, and then compare the similarity between
their contexts and the contexts of seeds7. We further use the similarity evaluated by our
three semantic context based approaches to represent the feature space given a pair, and
then train a binary-class classifier on a small number of labeled data varying with different
base classifiers to select the best one for predicting sparse extractions. Extensive studies
show that our approach can achieve better performance than state-of-the-art approaches in
sparse extraction assessment.</P>
<H2>Data Sets </H2>
<P>Considering the experimental data sets, we randomly
selected about 1800 entities that belong to 12 classes in
Probase. Tables 2 and 3 show the descriptions and some
examples in each class respectively. Each entity has no
more than 10 occurrences in Hearst patterns and we call
them sparse extractions. This is because more than 90%
entities of the above 12 concepts have no more than 10
occurrences in Probase, namely lying in the long tail
of the entity distribution curves. For example, Figure
2 shows the frequency distribution varying the number
of entities in country. We can clearly see the long tail
phenomenon under the dotted line with no more than
10 occurrences. We asked human judges to evaluate their
correctness.We also looked into three binary relations: is-
CapitalOf, isCurrencyOf, and headquarteredIn. We randomly
picked 315 sparse extractions that have no more than
10 occurrences, and we also picked the 10 most frequent
extractions for each relation which serve as seeds. Details of all test relationships are shown in Table 2.</P>
</P>
<P align="center"><B>Table 2: Data sets used in experiments</B></P>
<TABLE align="center" class=" borderColumns borderRows tableBorder" cellSpacing="0" cellPadding="0">
<TBODY>
<TR>
<TD align="center"><B></B></TD>
<TD align="center"><B>total pairs in Probase</B></TD>
<TD align="center"><B>paris with frequency < 10 </B></TD>
<TD align="center"><B>pairs in experiments</B></TD>
<TD align="center"><B>#bad pairs</B></TD>
<TD align="center"><B>#good pairs</B></TD>
</TR>
<TR>
<TD align="center" colspan="6">isA relationships</TD>
</TR>
<TR>
<TD align="center"><B>country</B></TD>
<TD align="center"><B>5534</B></TD>
<TD align="center"><B>92.81% </B></TD>
<TD align="center"><B>415</B></TD>
<TD align="center"><B>226</B></TD>
<TD align="center"><B>189</B></TD>
</TR>
<TR>
<TD align="center"><B>sport</B></TD>
<TD align="center"><B>2866</B></TD>
<TD align="center"><B>92.18% </B></TD>
<TD align="center"><B>335</B></TD>
<TD align="center"><B>67</B></TD>
<TD align="center"><B>268</B></TD>
</TR>
<TR>
<TD align="center"><B>city</B></TD>
<TD align="center"><B>8815</B></TD>
<TD align="center"><B>90.05% </B></TD>
<TD align="center"><B>231</B></TD>
<TD align="center"><B>33</B></TD>
<TD align="center"><B>198</B></TD>
</TR>
<TR>
<TD align="center"><B>animal</B></TD>
<TD align="center"><B>5562</B></TD>
<TD align="center"><B>92.38% </B></TD>
<TD align="center"><B>186</B></TD>
<TD align="center"><B>37</B></TD>
<TD align="center"><B>149</B></TD>
</TR>
<TR>
<TD align="center"><B>seasoning</B></TD>
<TD align="center"><B>531</B></TD>
<TD align="center"><B>92.47% </B></TD>
<TD align="center"><B>169</B></TD>
<TD align="center"><B>41</B></TD>
<TD align="center"><B>128</B></TD>
</TR>
<TR>
<TD align="center"><B>company</B></TD>
<TD align="center"><B>59734</B></TD>
<TD align="center"><B>96.84%</B></TD>
<TD align="center"><B>82</B></TD>
<TD align="center"><B>9</B></TD>
<TD align="center"><B>73</B></TD>
</TR>
<TR>
<TD align="center"><B>painter</B></TD>
<TD align="center"><B>1097</B></TD>
<TD align="center"><B>98.09% </B></TD>
<TD align="center"><B>81</B></TD>
<TD align="center"><B>5</B></TD>
<TD align="center"><B>76</B></TD>
</TR>
<TR>
<TD align="center"><B>currency</B></TD>
<TD align="center"><B>330</B></TD>
<TD align="center"><B>91.82%</B></TD>
<TD align="center"><B>78</B></TD>
<TD align="center"><B>8</B></TD>
<TD align="center"><B>70</B></TD>
</TR>
<TR>
<TD align="center"><B>disease</B></TD>
<TD align="center"><B>8280</B></TD>
<TD align="center"><B>92.60%</B></TD>
<TD align="center"><B>69</B></TD>
<TD align="center"><B>9</B></TD>
<TD align="center"><B>60</B></TD>
</TR>
<TR>
<TD align="center"><B>film</B></TD>
<TD align="center"><B>10859</B></TD>
<TD align="center"><B>96.62%</B></TD>
<TD align="center"><B>65</B></TD>
<TD align="center"><B>25</B></TD>
<TD align="center"><B>40</B></TD>
</TR>
<TR>
<TD align="center"><B>language</B></TD>
<TD align="center"><B>2703</B></TD>
<TD align="center"><B>93.53% </B></TD>
<TD align="center"><B>51</B></TD>
<TD align="center"><B>6</B></TD>
<TD align="center"><B>45</B></TD>
</TR>
<TR>
<TD align="center"><B>river</B></TD>
<TD align="center"><B>1924</B></TD>
<TD align="center"><B>97.77%</B></TD>
<TD align="center"><B>40</B></TD>
<TD align="center"><B>2</B></TD>
<TD align="center"><B>38</B></TD>
</TR>
<TR>
<TD align="center"><B>total</B></TD>
<TD align="center"><B>108235</B></TD>
<TD align="center"><B>92.25%</B></TD>
<TD align="center"><B>1802</B></TD>
<TD align="center"><B>468</B></TD>
<TD align="center"><B>1334</B></TD>
</TR>
<TR>
<TD align="center" colspan="6">Binary relationships</TD>
</TR>
<TR>
<TD align="center" colspan="3">isCapitalOf(country, city)</TD>
<TD align="center"><B>160</B></TD>
<TD align="center"><B>39</B></TD>
<TD align="center"><B>121</B></TD>
</TR>
<TR>
<TD align="center" colspan="3">isCurrencyOf(country, currency)</TD>
<TD align="center"><B>80</B></TD>
<TD align="center"><B>19</B></TD>
<TD align="center"><B>61</B></TD>
</TR>
<TR>
<TD align="center" colspan="3">headquarteredIn(company, city)</TD>
<TD align="center"><B>75</B></TD>
<TD align="center"><B>22</B></TD>
<TD align="center"><B>235</B></TD>
</TR>
<TR>
<TD align="center" colspan="3">total</TD>
<TD align="center"><B>315</B></TD>
<TD align="center"><B>80</B></TD>
<TD align="center"><B>235</B></TD>
</TR>
</TBODY></TABLE>
<P align="center"><B>Table 3: Examples of isA relations </B></P>
<P>
<TABLE align="center" class=" borderColumns borderRows tableBorder" cellSpacing="0" cellPadding="0">
<TR>
<TD align="center"><B>isA relation</B></TD>
<TD align="center"><B>#bad pair</B></TD>
<TD align="center"><B>#good pair</B></TD>
</TR>
<TR>
<TD align="center"><B>country</B></TD>
<TD align="center"><B><country, democratic people></B></TD>
<TD align="center"><B><country, g77></B></TD>
</TR>
<TR>
<TD align="center"><B>city</B></TD>
<TD align="center"><B><city, santa martha></B></TD>
<TD align="center"><B><city, amadora></B></TD>
</TR>
<TR>
<TD align="center"><B>sport</B></TD>
<TD align="center"><B><sport, trafalgar park></B></TD>
<TD align="center"><B><sport, girls golf></B></TD>
</TR>
<TR>
<TD align="center"><B>animal</B></TD>
<TD align="center"><B><animal, cauquenes></B></TD>
<TD align="center"><B><animal, moon snail></B></TD>
</TR>
<TR>
<TD align="center"><B>seasoning</B></TD>
<TD align="center"><B><seasoning, bacon bit></B></TD>
<TD align="center"><B><seasoning, five spice></B></TD>
</TR>
<TR>
<TD align="center"><B>company</B></TD>
<TD align="center"><B><company, institute></B></TD>
<TD align="center"><B><company, hasbro></B></TD>
</TR>
<TR>
<TD align="center"><B>painter</B></TD>
<TD align="center"><B><painter, robert young></B></TD>
<TD align="center"><B><painter, childe hassam></B></TD>
</TR>
<TR>
<TD align="center"><B>film</B></TD>
<TD align="center"><B><film, forest gump></B></TD>
<TD align="center"><B><film, breach></B></TD>
</TR>
<TR>
<TD align="center"><B>language</B></TD>
<TD align="center"><B><language, francophone></B></TD>
<TD align="center"><B><language, micmac></B></TD>
</TR>
<TR>
<TD align="center"><B>river</B></TD>
<TD align="center"><B><river, manda></B></TD>
<TD align="center"><B><river, missouri river></B></TD>
</TR>
<TR>
<TD align="center"><B>isCapitalOf(country, city)</B></TD>
<TD align="center"><B><dili, east timor></B></TD>
<TD align="center"><B><andorra, andorra la Vella></B></TD>
</TR>
<TR>
<TD align="center"><B>isCurrencyOf(country, currency)</B></TD>
<TD align="center"><B><baht, thailand></B></TD>
<TD align="center"><B><colombia, colombian peso></B></TD>
</TR>
<TR>
<TD align="center"><B>headquarteredIn(company, city);</B></TD>
<TD align="center"><B><espoo, general electric></B></TD>
<TD align="center"><B><michelin, clermont-ferrand></B></TD>
</TR>
</TABLE>
</P>
<P></P>
<DIV style="clear: both;"></DIV>
<DIV class="conM ">
<H2>Used Data Sets: Download</H2>
<P>More Details Refer to <A onclick="stc(this, 26)" href="https://github.com/peipeilihfut/AssessSparseIE/blob/master/AllDataSets.pdf"
target="_new"> Used Data Sets (1) </A>.</P>
<P>More Details Refer to <A onclick="stc(this, 26)" href="https://github.com/peipeilihfut/AssessSparseIE/blob/master/New-labeled-data-sets.pdf" target="_new"> Used Data Sets (new) </A>.</P></DIV>
<DIV style="clear: both;"></DIV>
<DIV class="conM ">
<H2>Source codes: Download</H2>
<P>Our project is implemented by C# and SQL Server. Base classifiers used in our approach are from <A onclick="stc(this, 26)" href="http://www.cs.waikato.ac.nz/ml/weka/downloading.html" target="_new"> Weka-3.8.1.jar</A>. Souce codes of this project refer to <A onclick="stc(this, 26)" href="https://github.com/peipeilihfut/AssessSparseIE/blob/master/AssessSparseIEProject.rar"
target="_new"> Source codes</A>.</P>
</DIV>
<P>
Our AM (attribute-based context), CM (concept-based context) and IM (Isa-based context) approaches have similar parametes, we explain the parameter list of CM as an example. Main functions of these three approaches are called AMMain(string[] args), SuperConceptBasedMain(string[] args) and IMBasedMain(string[] args) in the file of "CleaningMain.cs".
<P align="left"><B> Parameter list for our CM approach</B></P>
<TABLE align="center" class=" borderColumns borderRows tableBorder" cellSpacing="0" cellPadding="0">
<TBODY>
<TR><TD align="center" colspan="2"><B>Parameter list for our CM approach</B></TD>
</TR>
<TR>
<TD align="center"><B>Variable</B></TD>
<TD align="left"><B>Description</B></TD>
</TR>
<TR>
<TD align="center"><B>databaseServer</B></TD>
<TD align="left"><B>the name of database;</B></TD>
</TR>
<TR>
<TD align="center"><B>databaseName</B></TD>
<TD align="left"><B>the name of database;</B></TD>
</TR>
<TR>
<TD align="center"><B>testEntityTable</B></TD>
<TD align="left"><B>the table of conceptualization;</B></TD>
</TR>
<TR>
<TD align="center"><B>isSelectedTopK</B></TD>
<TD align="left"><B>whether select Top tokens or not, 1: yes, 0: no;</B></TD>
</TR>
<TR>
<TD align="center"><B>classNumThres</B></TD>
<TD align="left"><B>the maximum number of concepts in conceptualization;</B></TD>
</TR>
<TR>
<TD align="center"><B>distEvalType</B></TD>
<TD align="left"><B>the type of distance evaluation;</B></TD>
</TR>
<TR>
<TD align="center"><B>seedsNum</B></TD>
<TD align="left"><B>the number of seeds;</B></TD>
</TR>
<TR>
<TD align="center"><B>bUseClustering</B></TD>
<TD align="left"><B>whether use clustering or not, default: false;</B></TD>
</TR>
<TR>
<TD align="center"><B>pathStr</B></TD>
<TD align="left"><B>directory of files;</B></TD>
</TR>
</TBODY>
</TABLE>
</P>
<DIV style="clear: both;"></DIV>
<DIV class="conM ">
<H2>References</H2>
<P>[1] Knowledgebase Probase: http://research.microsoft.com/en-us/projects/probase/release.aspx</P>
<P>[2] W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: a probabilistic taxonomy
for text understanding. In Proceedings of SIGMOD'12, pages 481-492, 2012.</P><H2>Please cite the following references if you use this source code</H2>
<P>[1] Peipei Li, Haixun Wang, Hongsong Li, and Xindong Wu, Employing Semantic Context for Sparse Information Extraction Assessment, ACM Transactions on Knowledge Discovery from Data,12(5): 54:1-36, July 2018.</P>
<P>[2] Peipei Li, Haixun Wang, Hongsong Li, and Xindong Wu, Assessing Sparse Information Extraction using Semantic Contexts, In: Proceedings of 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), pp.1709-1714, San Francisco, CA, USA, 2013.10.28-11.01</P>
</DIV>
<DIV class="conM ">
<H2>Contact</H2>
<P><A title="" style="zoom: 1;" onclick="stc(this, 30)" href="http://ci.hfut.edu.cn/index/teacherinfo/tid/522"
target="_new" alt="">Peipei Li</A> (peipeili@hfut.edu.cn): Hefei University of Technology, China<BR>
<A title="" style="zoom: 1;" onclick="stc(this, 30)" href="http://haixun.olidu.com/"
target="_new" alt="">Haixun Wang</A> (haixun@google.com): Google Research, USA <BR><A title="" style="zoom: 1;" onclick="stc(this, 29)"
target="_new" alt="">Hongsong Li</A> (hongsong.lhs@alibaba-inc.com): Alibaba Group, China <BR><A title="" style="zoom: 1;" onclick="stc(this, 30)" href="http://www.ucs.louisiana.edu/~xxw8007/"
target="_new" alt="">Xindong Wu</A> (xwu@uvm.edu): University of Louisiana at Lafayett, USA</P>
<P class="smallText"></P>
<P class="smallText"> </P></DIV>
<DIV style="clear: both;"></DIV></DIV>
<DIV style="clear: left;"></DIV></DIV></DIV></DIV><!--NOINDEX_START-->
<DIV class="cl"></DIV>
<DIV class="bt" id="bGrad"></DIV></DIV></DIV>
</DIV>
</DIV></BODY></HTML>