USE CASE |
|
Title: | Benefits from Extending KBpedia with Private Datasets |
Short Description: |
Significant improvements in tagging accuracy may be obtained by adding
private (enterprise- or domain-specific) datasets to the standard
public knowledge bases already in KBpedia
|
Problem: |
We want to obtain as comprehensive and accurate tagging of entities as possible for our specific enterprise needs
|
Approach: | KBpedia provides a rich set of 30 million entities in its standard configuration. However, by identifying and including relevant entity lists already in the possession of the enterprise or from specialty datasets in the relevant domain, significant improvements can be achieved across all of the standard metrics used for entity recognition and tagging. Further, our standard methodology includes the creation of reference, or "gold standards", for measuring the benefits of adding more data or performing other tweaks on the entity extraction algorithms |
Key Findings: |
|
This use case demonstrates two aspects of working with the KBpedia knowledge structure. First, we demonstrate the benefits of adding private datasets to the standard knowledge bases included with KBpedia. And, second, we highlight our standard use of reference, or "gold", standards as a way of measuring progress in tweaking datasets and parameters when doing machine learning tasks.
The basis for this use case is an enterprise that is monitoring information published on the Web and wants to be able to identify the organization responsible for publishing a given page. The enterprise monitors the Web on a daily basis in its domain of interest and is able to identify new Web pages it has not seen before. Further, the enterprise also has two of its own datasets that contain possible candidate organizations that might be publishers of such pages. These two datasets are private and not available to the general public.
In this use case, we describe a publisher analyzer used for organization identification, the standard KBpedia datasets available for the task, the enterprise's own private datasets, and the approach we take to "gold standards" and the specifics of that standard for this case. Once these component parts are described, we proceed to give the results of adding or using different datasets. We then summarize some conclusions.
Note this use case is broadly applicable to any entity recognition and tagging initiative.
The analysis framework is comprised of general platform code, the publisher analyzer, the standard KBpedia knowledge structure and its public knowledge bases, reference "gold standards", and, for the test, external (private enterprise) data.
The publisher analyzer attempts to determine the publisher of a web page from analyzing the web page's content. There are multiple moving parts to this analyzer, but its general internal workflow is as follows:
The machine learning system leverages existing algorithms to calculate the likelihood that an organization is the publisher of a web page and to detect unknown organizations. These are conventional uses of these algorithms. What differentiates the publisher analyzer is its knowledge base. We leverage KBpedia to detect known organization entities. We use the knowledge in KBpedia's combined KBs for each of these entities to improve the analysis process. We constrain the analysis to certain types (by inference) of named entities, etc. The special sauce of this entire process is the fully integrated set of knowledge bases that comprise KBpedia, including its hundreds of thousands of concepts, 39,000 reference concepts, and 20 million known entities.
These public datasets are then compared to two private datasets, which contain high-quality, curated, and domain-related listings of organizations. The numbers of organizations contained in these private datasets are much smaller than those in the public ones, but they are also more relevant to the domain. These private datasets are fairly typical of the specific information that an enterprise may have available in its own domain.
The reference standard, or "gold standard", employed in this use case is composed of 511 randomly selected Web pages that are manually vetted and characterized. (As a general rule of thumb we find about 500 examples in the positive training set to be adequate.)
The gold standard is really simple. For each of the URLs we have in the standard, we determine the publishing organization manually. Then once the organization is determined, we search in each dataset to see if the entity is already existing. If it is, then we add the URI (unique identifier) of the entity in the knowledge base into the gold standard. It is this URI reference that is used to determine if the publisher analyzer properly detects the actual publisher of the web page.
We also add a set of 10 web pages manually for which we are sure that no publisher can be determined for the web page. These are the 10 True Negative
(see below) instances of the gold standard.
The gold standard also includes the identifier of possible unknown entities that are the publishers of the web pages. These are used to calculate the metrics when considering the unknown entities detected by the system.
The goal of the analysis is to determine how good the analyzer is to perform the task (detecting the organization that published a web page on the Web). What we have to do is to use a set of metrics that will help us understand the performance of the system. The metrics calculation is based on the confusion matrix.
When we are processing a new run, all results are characterized according to four possible scoring values:
True Positive (TP)
: test identifies the same entity as in the gold standardFalse Positive (FP)
: test identifies a different entity than what is in the gold standardTrue Negative (TN)
: test identifies no entity; gold standard has no entityFalse Negative (FN)
: test identifies no entity, but the gold standard has one
The True Positive
, False Positive
, True Negative
and False Negative
(see Type I and type II errors for definitions) should be interpreted in the manner common to in named entity recognition tasks.
This simple scoring method allows us to apply a series of metrics based on the four possible scoring values:
(TP / TP + FP)
(TP / TP + FN)
((TP + TN) / (TP + TN + FP + FN))
The F-score is a common combined score of the general prediction system. The F-score is a measure that combines precision and recall via their harmonic mean. The f2
measure weighs recall higher than precision (by placing more emphasis on false negatives), and the f0.5
measure weighs recall lower than precision (by attenuating the influence of false negatives). Cognonto includes all three F-measures in its standard reports to give a general overview of what happens when we put an emphasis on precision or recall. Some clients prefer limiting false positives at the sake of lower recall; others want fuller coverage.
Still, in most instances, we have found customers find accuracy
to be the most useful metric. We particularly emphasize that metric in the results below.
The goal with these tests is to run the gold standard calculation procedure against various combinations of the available datasets in order to determine their comparative contribution to improved accuracy (or other metric of choice). Here is the general run procedure; note that each run has a standard presentation of the run statistics, beginning with the four scoring values, followed by the standard metrics:
The first step is to create the starting basis that includes no dataset. Then we will add different datasets, and try different combinations, when computing against the gold standard such that we know the impact of each on the metrics.
(table (generate-stats :js :execute :datasets []))
True positives: 2 False positives: 5 True negatives: 19 False negatives: 485 +--------------+--------------+ | key | value | +--------------+--------------+ | :precision | 0.2857143 | | :recall | 0.0041067763 | | :accuracy | 0.04109589 | | :f1 | 0.008097166 | | :f2 | 0.0051150895 | | :f0.5 | 0.019417476 | +--------------+--------------+
Now, let's see the impact of each of the datasets that exist in the knowledge base we created to perform these tests. This will gives us an indicator of the inherent impact of each dataset on the prediction task.
Let's test the impact of adding a single general purpose dataset, the publicly available: Wikipedia (via DBpedia):
(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"]))
True positives: 121 False positives: 57 True negatives: 19 False negatives: 314 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.6797753 | | :recall | 0.27816093 | | :accuracy | 0.2739726 | | :f1 | 0.39477977 | | :f2 | 0.31543276 | | :f0.5 | 0.52746296 | +--------------+------------+
Now, let's test the impact of adding another single general purpose dataset, this one the publicly available: Freebase:
(table (generate-stats :js :execute :datasets ["http://rdf.freebase.com/ns/"]))
True positives: 11 False positives: 14 True negatives: 19 False negatives: 467 +--------------+-------------+ | key | value | +--------------+-------------+ | :precision | 0.44 | | :recall | 0.023012552 | | :accuracy | 0.058708414 | | :f1 | 0.043737575 | | :f2 | 0.028394425 | | :f0.5 | 0.09515571 | +--------------+-------------+
Now, let's test the impact of adding still a different publicly available specialized dataset: USPTO:
(table (generate-stats :js :execute :datasets ["http://www.uspto.gov"]))
True positives: 6 False positives: 13 True negatives: 19 False negatives: 473 +--------------+-------------+ | key | value | +--------------+-------------+ | :precision | 0.31578946 | | :recall | 0.012526096 | | :accuracy | 0.04892368 | | :f1 | 0.024096385 | | :f2 | 0.015503876 | | :f0.5 | 0.054054055 | +--------------+-------------+
Now, let's test the first private dataset:
(table (generate-stats :js :execute :datasets ["http://kbpedia.org/datasets/private/1/"]))
True positives: 231 False positives: 109 True negatives: 19 False negatives: 152 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.67941177 | | :recall | 0.60313314 | | :accuracy | 0.4892368 | | :f1 | 0.6390042 | | :f2 | 0.61698717 | | :f0.5 | 0.6626506 | +--------------+------------+
And, then, the second private dataset:
(table (generate-stats :js :execute :datasets ["http://kbpedia.org/datasets/private/2/"]))
True positives: 24 False positives: 21 True negatives: 19 False negatives: 447 +--------------+-------------+ | key | value | +--------------+-------------+ | :precision | 0.53333336 | | :recall | 0.050955415 | | :accuracy | 0.08414873 | | :f1 | 0.093023255 | | :f2 | 0.0622084 | | :f0.5 | 0.1843318 | +--------------+-------------+
A more realistic analysis is to use a combination of datasets. Let's see what happens to the performance metrics if we start combining the public datasets only.
First, let's start by combining Wikipedia and Freebase.
(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/" "http://rdf.freebase.com/ns/"]))
True positives: 126 False positives: 60 True negatives: 19 False negatives: 306 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.67741936 | | :recall | 0.29166666 | | :accuracy | 0.28375733 | | :f1 | 0.407767 | | :f2 | 0.3291536 | | :f0.5 | 0.53571427 | +--------------+------------+
Adding the Freebase dataset to the DBpedia one has the following effects on the different metrics:
metric | Impact in % |
---|---|
precision | -0.03% |
recall | +4.85% |
accuracy | +3.57% |
f1 | +3.29% |
f2 | +4.34% |
f0.5 | +1.57% |
As we can see, the impact of adding Freebase to the knowledge base is positive even if not ground breaking considering the size of the dataset.
Let's switch Freebase for the other specialized public dataset, USPTO (organizations with trademarks in the US Patent and Trademark Office dataset).
(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/" "http://www.uspto.gov"]))
True positives: 122 False positives: 59 True negatives: 19 False negatives: 311 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.67403316 | | :recall | 0.2817552 | | :accuracy | 0.27592954 | | :f1 | 0.39739415 | | :f2 | 0.31887087 | | :f0.5 | 0.52722555 | +--------------+------------+
Adding the USPTO dataset to the DBpedia one instead of Freebase has the following effects on the different metrics:
metric | Impact in % |
---|---|
precision | -0.83% |
recall | +1.29% |
accuracy | +0.73% |
f1 | +0.65% |
f2 | +1.07% |
f0.5 | +0.03% |
As we may have expected, the gains are smaller than Freebase. This may be partly due to the fact that the USPTO dataset is smaller and more specialized than Freebase. Because it is more specialized (enterprises that have trademarks registered in US), maybe the gold standard doesn't represent well the organizations belonging to this dataset. But in any case, there are still gains.
Let's continue and now include all three datasets.
(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/" "http://www.uspto.gov" "http://rdf.freebase.com/ns/"]))
True positives: 127 False positives: 62 True negatives: 19 False negatives: 303 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.6719577 | | :recall | 0.29534882 | | :accuracy | 0.2857143 | | :f1 | 0.41033927 | | :f2 | 0.3326349 | | :f0.5 | 0.53541315 | +--------------+------------+
Now let's see the impact of adding both Freebase and USPTO to the Wikipedia dataset:
metric | Impact in % |
---|---|
precision | +1.14% |
recall | +6.18% |
accuracy | +4.30% |
f1 | +3.95% |
f2 | +5.45% |
f0.5 | +1.51% |
This combination of public datasets is a key baseline for the conclusions below.
Now let's see the impact of using adding the private datasets. We will continue to use the combination of the three public datasets (Wikipedia, Freebase and USPTO) to which we will add the private datasets (PD #1 and PD #2).
We will first add one of the private datasets (PD #1).
(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/" "http://www.uspto.gov" "http://rdf.freebase.com/ns/" "http://kbpedia.org/datasets/private/1/"]))
True positives: 279 False positives: 102 True negatives: 19 False negatives: 111 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.7322835 | | :recall | 0.7153846 | | :accuracy | 0.58317024 | | :f1 | 0.7237354 | | :f2 | 0.7187017 | | :f0.5 | 0.7288401 | +--------------+------------+
When we compare these results to just the combination of the three public datasets, we get these percentage immprovements:
metric | Impact in % |
---|---|
precision | +8.97% |
recall | +142.22% |
accuracy | +104.09% |
f1 | +76.38% |
f2 | +116.08% |
f0.5 | +36.12% |
If we run the private dataset #1 alone (not in combination with the public ones), we get these lesser improvements:
metric | Impact in % |
---|---|
precision | +7.77% |
recall | +18.60% |
accuracy | +19.19% |
f1 | +13.25% |
f2 | +16.50% |
f0.5 | +9.99% |
So, while the highly targeted private dataset #1 performs better than the three combined public datasets, the combination of private dataset #1 and the three public ones shows still further improvements.
We can repeat this analysis, only now focusing on the second private dataset (PD #2). This first run combines the three public datasets with PD #2:
(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/" "http://www.uspto.gov" "http://rdf.freebase.com/ns/" "http://kbpedia.org/datasets/private/2/"]))
True positives: 138 False positives: 69 True negatives: 19 False negatives: 285 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.6666667 | | :recall | 0.32624114 | | :accuracy | 0.3072407 | | :f1 | 0.43809524 | | :f2 | 0.36334914 | | :f0.5 | 0.55155873 | +--------------+------------+
We can see that PD #2 in combination with the three public datasets does not perform as well as PD #1 added to the three public ones. This observation just affirms that not all of the private datasets have equivalent impact. Here are the percent differences for when PD #2 is added to the three public datasets vs. the three public datasets alone:
metric | Impact in % |
---|---|
precision | -0.78% |
recall | +10.46% |
accuracy | +7.52% |
f1 | +6.75% |
f2 | +9.23% |
f0.5 | +3.00% |
In this case, we actually see that precision drops when adding PD #2, though accuracy is still improved.
Now that we have seen the impact of PD #1 and PD #2 in isolation, let's see what happens when we combine all of the public and private datasets. First, let's look at the raw metrics of the run:
(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/" "http://www.uspto.gov" "http://rdf.freebase.com/ns/" "http://kbpedia.org/datasets/private/1/" "http://kbpedia.org/datasets/private/2/"]))
True positives: 285 False positives: 102 True negatives: 19 False negatives: 105 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.7364341 | | :recall | 0.7307692 | | :accuracy | 0.59491193 | | :f1 | 0.7335907 | | :f2 | 0.7318952 | | :f0.5 | 0.7352941 | +--------------+------------+
As before, let's look at the percentage changes due to adding both of the private datasets #1 and #2 to the three public datasets:
metric | Impact in % |
---|---|
precision | +9.60% |
recall | +147.44% |
accuracy | +108.22% |
f1 | +78.77% |
f2 | +120.02% |
f0.5 | +37.31% |
Note, for all metrics, this total combination of all datasets performs best compared to any of the other tested combinations.
There is one last feature with the publisher analyzer that we should highlight. The analyzer also allows us to identify unknown entities
from the web page. (An "unknown entity" is identified as a likely organization entity, but which does not already exist in the KB.) Sometimes, it is the unknown entity that is the publisher of the web page. The usefulness of the unknown entity identification is to flag new possible entities (organizations, in this case) that should be considered for addition to the overall knowledge base.
(table (generate-stats :js :execute :datasets :all))
True positives: 345 False positives: 104 True negatives: 19 False negatives: 43 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.76837415 | | :recall | 0.88917524 | | :accuracy | 0.7123288 | | :f1 | 0.82437277 | | :f2 | 0.86206895 | | :f0.5 | 0.78983516 | +--------------+------------+
As we can see, the overall accuracy improved by 19.73%
when considering the unknown entities compared to the public and private datasets.
metric | Impact in % |
---|---|
precision | +4.33% |
recall | +21.67% |
accuracy | +19.73% |
f1 | +12.37% |
f2 | +17.79% |
f0.5 | +7.42% |
When we first tested the system with single datasets, some of them were scoring better than others for most of the metrics. However, does that mean that we could only use them and be done with it? No, what this analysis is telling us is that some datasets score better for this set of web pages. They cover more entities found in those web pages. However, even if a dataset was scoring lower it does not mean it is useless. In fact, that worse dataset may in fact cover one prediction area not covered in a better one, which means that by combining the two, we could improve the general prediction power of the system. This is what we can see by adding the private datasets to the public ones.
Even if the highly curated and domain-specific private datasets score much better than the more general public datasets, the system still greatly benefits from the contribution of the public datasets by significantly improving the accuracy of the system. We achieve a gain of 108%
in accuracy by adding the private datasets to KBpedia's public ones. What this tells us is that KBpedia is a very useful structure and starting point for an entity tagging effort, but that adding domain data is probably essential to gain the overall accuracy desired for enterprise requirements.
Another thing that this series of tests tends to demonstrate is that the more knowledge we have, the more we can improve the accuracy of the system. Adding datasets doesn't appear to lower the overall performance of the system (even though we did see one case of a slight decrease in precision for PD #2 even though all other metrics improved). Generally, the more data available for a given task, the better the results.
Finally, adding a feature to the system can also greatly improve its overall accuracy. In this case, we added the feature of detecting unknown entities (organization entities that are not existing in the datasets that compose the knowledge base), which improves the overall accuracy by another 20%
. How is that possible? To understand this we have to consider the domain: random web pages that exist on the Web. A web page can be published by anybody and any organization. This means that the long tail of web page publisher is probably pretty long. Considering this fact, it is normal that existing knowledge bases may not contain all of the obscure organizations that publish web pages. It is most likely why having a system that can detect and predict unknown entities as the publishers of web page will have a significant impact on the overall accuracy of the system. The flagging of such "unknown" entities tells us where to focus efforts to add to the known database of existing publishers.
As we saw in this analysis, adding high quality and domain-specific private datasets can greatly improve the accuracy of such a prediction system. Some datasets may have a more significant impact than others, but overall, each dataset contributes to the overall improvements of the predictions.