Benefits from Extending KBpedia with Private Datasets

USE CASE
Title:	Benefits from Extending KBpedia with Private Datasets
Short Description:	Significant improvements in tagging accuracy may be obtained by adding private (enterprise- or domain-specific) datasets to the standard public knowledge bases already in KBpedia
Problem:	We want to obtain as comprehensive and accurate tagging of entities as possible for our specific enterprise needs
Approach:	KBpedia provides a rich set of 30 million entities in its standard configuration. However, by identifying and including relevant entity lists already in the possession of the enterprise or from specialty datasets in the relevant domain, significant improvements can be achieved across all of the standard metrics used for entity recognition and tagging. Further, our standard methodology includes the creation of reference, or "gold standards", for measuring the benefits of adding more data or performing other tweaks on the entity extraction algorithms
Key Findings:	In this specific example, adding private enterprise data results in more than a doubling of accuracy (108%) over the standard, baseline KBpedia for identifying the publishing organization of a Web page Some datasets may have a more significant impact than others, but overall, each dataset contributes to the overall improvements of the predictions. Generally adding more data improves results across all measured metrics Approx. 500 training cases are sufficient to build a useful "gold standard" for entity tagging; negative training examples are also advisable "Gold standards" are an essential component to testing the value of adding specific datasets or refining machine learning parameters Even if all specific entities are not identified, flagging a potential "unknown" entity is an important means for targeted next efforts of adding to the current knowledge base KBpedia is a very useful structure and starting point for an entity tagging effort, but that adding domain data is probably essential to gain the overall accuracy desired for enterprise requirements This use case is broadly applicable to any entity recognition and tagging initiative.

This use case demonstrates two aspects of working with the KBpedia knowledge structure. First, we demonstrate the benefits of adding private datasets to the standard knowledge bases included with KBpedia. And, second, we highlight our standard use of reference, or "gold", standards as a way of measuring progress in tweaking datasets and parameters when doing machine learning tasks.

The basis for this use case is an enterprise that is monitoring information published on the Web and wants to be able to identify the organization responsible for publishing a given page. The enterprise monitors the Web on a daily basis in its domain of interest and is able to identify new Web pages it has not seen before. Further, the enterprise also has two of its own datasets that contain possible candidate organizations that might be publishers of such pages. These two datasets are private and not available to the general public.

In this use case, we describe a publisher analyzer used for organization identification, the standard KBpedia datasets available for the task, the enterprise's own private datasets, and the approach we take to "gold standards" and the specifics of that standard for this case. Once these component parts are described, we proceed to give the results of adding or using different datasets. We then summarize some conclusions.

Note this use case is broadly applicable to any entity recognition and tagging initiative.

The Analysis Framework

The analysis framework is comprised of general platform code, the publisher analyzer, the standard KBpedia knowledge structure and its public knowledge bases, reference "gold standards", and, for the test, external (private enterprise) data.

Overview of the Publisher Analyzer

The publisher analyzer attempts to determine the publisher of a web page from analyzing the web page's content. There are multiple moving parts to this analyzer, but its general internal workflow is as follows:

It crawls a given webpage URL
It extracts the page's content and extracts its meta-data (including "defluffing", which is the removal of navigation, ads, and normal Web page boilerplate)
It tags all of the organizations (anything that is considered an organization in KBpedia) across the extracted content using the organization entities that exist in the knowledge base
It conducts certain specialty analysis related to page "signals" that might indicate an organizational entity
It detects unknown entities that will eventually be added to the knowledge base after curation
It performs an in-depth analysis of the organization entities (known or unknown) that got tagged in the content of the web page, and analyzes which of these is the most likely to be the publisher of the web page.

The machine learning system leverages existing algorithms to calculate the likelihood that an organization is the publisher of a web page and to detect unknown organizations. These are conventional uses of these algorithms. What differentiates the publisher analyzer is its knowledge base. We leverage KBpedia to detect known organization entities. We use the knowledge in KBpedia's combined KBs for each of these entities to improve the analysis process. We constrain the analysis to certain types (by inference) of named entities, etc. The special sauce of this entire process is the fully integrated set of knowledge bases that comprise KBpedia, including its hundreds of thousands of concepts, 39,000 reference concepts, and 20 million known entities.

Public Datasets

This use case begins with three public datasets already in KBpedia: Wikipedia (via DBpedia), Freebase and USPTO.

Private Datasets

These public datasets are then compared to two private datasets, which contain high-quality, curated, and domain-related listings of organizations. The numbers of organizations contained in these private datasets are much smaller than those in the public ones, but they are also more relevant to the domain. These private datasets are fairly typical of the specific information that an enterprise may have available in its own domain.

Gold Standard

The reference standard, or "gold standard", employed in this use case is composed of 511 randomly selected Web pages that are manually vetted and characterized. (As a general rule of thumb we find about 500 examples in the positive training set to be adequate.)

The gold standard is really simple. For each of the URLs we have in the standard, we determine the publishing organization manually. Then once the organization is determined, we search in each dataset to see if the entity is already existing. If it is, then we add the URI (unique identifier) of the entity in the knowledge base into the gold standard. It is this URI reference that is used to determine if the publisher analyzer properly detects the actual publisher of the web page.

We also add a set of 10 web pages manually for which we are sure that no publisher can be determined for the web page. These are the 10 True Negative (see below) instances of the gold standard.

The gold standard also includes the identifier of possible unknown entities that are the publishers of the web pages. These are used to calculate the metrics when considering the unknown entities detected by the system.

Analysis Metrics

The goal of the analysis is to determine how good the analyzer is to perform the task (detecting the organization that published a web page on the Web). What we have to do is to use a set of metrics that will help us understand the performance of the system. The metrics calculation is based on the confusion matrix.

When we are processing a new run, all results are characterized according to four possible scoring values:

True Positive (TP): test identifies the same entity as in the gold standard
False Positive (FP): test identifies a different entity than what is in the gold standard
True Negative (TN): test identifies no entity; gold standard has no entity
False Negative (FN): test identifies no entity, but the gold standard has one

The True Positive, False Positive, True Negative and False Negative (see Type I and type II errors for definitions) should be interpreted in the manner common to in named entity recognition tasks.

This simple scoring method allows us to apply a series of metrics based on the four possible scoring values:

Precision: is the proportion of properly predicted publishers amongst all of the publishers that exists in the gold standard (TP / TP + FP)
Recall: is the proportion of properly predicted publishers amongst all the predictions that have been made (good and bad) (TP / TP + FN)
Accuracy: is the proportion of correctly classified test instances; the publishers that could be identified by the system, and the ones that couldn't (the web pages for which no publisher could be identified). ((TP + TN) / (TP + TN + FP + FN))
f1: is the test's equally weighted combination of precision and recall
f2: is the test's weighted combination of precision and recall, with a preference for recall
f0.5: the test's weighted combination of precision and recall, with a preference for precision.

The F-score is a common combined score of the general prediction system. The F-score is a measure that combines precision and recall via their harmonic mean. The f2 measure weighs recall higher than precision (by placing more emphasis on false negatives), and the f0.5 measure weighs recall lower than precision (by attenuating the influence of false negatives). Cognonto includes all three F-measures in its standard reports to give a general overview of what happens when we put an emphasis on precision or recall. Some clients prefer limiting false positives at the sake of lower recall; others want fuller coverage.

Still, in most instances, we have found customers find accuracy to be the most useful metric. We particularly emphasize that metric in the results below.

Running The Tests

The goal with these tests is to run the gold standard calculation procedure against various combinations of the available datasets in order to determine their comparative contribution to improved accuracy (or other metric of choice). Here is the general run procedure; note that each run has a standard presentation of the run statistics, beginning with the four scoring values, followed by the standard metrics:

Baseline: No Dataset

The first step is to create the starting basis that includes no dataset. Then we will add different datasets, and try different combinations, when computing against the gold standard such that we know the impact of each on the metrics.

(table (generate-stats :js :execute :datasets []))

True positives:  2
False positives:  5
True negatives:  19
False negatives:  485

+--------------+--------------+
| key          | value        |
+--------------+--------------+
| :precision   | 0.2857143    |
| :recall      | 0.0041067763 |
| :accuracy    | 0.04109589   |
| :f1          | 0.008097166  |
| :f2          | 0.0051150895 |
| :f0.5        | 0.019417476  |
+--------------+--------------+

One Dataset Only

Now, let's see the impact of each of the datasets that exist in the knowledge base we created to perform these tests. This will gives us an indicator of the inherent impact of each dataset on the prediction task.

Wikipedia (via DBpedia) Only

Let's test the impact of adding a single general purpose dataset, the publicly available: Wikipedia (via DBpedia):

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"]))

True positives:  121
False positives:  57
True negatives:  19
False negatives:  314

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.6797753  |
| :recall      | 0.27816093 |
| :accuracy    | 0.2739726  |
| :f1          | 0.39477977 |
| :f2          | 0.31543276 |
| :f0.5        | 0.52746296 |
+--------------+------------+

Freebase Only

Now, let's test the impact of adding another single general purpose dataset, this one the publicly available: Freebase:

(table (generate-stats :js :execute :datasets ["http://rdf.freebase.com/ns/"]))

True positives:  11
False positives:  14
True negatives:  19
False negatives:  467

+--------------+-------------+
| key          | value       |
+--------------+-------------+
| :precision   | 0.44        |
| :recall      | 0.023012552 |
| :accuracy    | 0.058708414 |
| :f1          | 0.043737575 |
| :f2          | 0.028394425 |
| :f0.5        | 0.09515571  |
+--------------+-------------+

USPTO Only

Now, let's test the impact of adding still a different publicly available specialized dataset: USPTO:

(table (generate-stats :js :execute :datasets ["http://www.uspto.gov"]))

True positives:  6
False positives:  13
True negatives:  19
False negatives:  473

+--------------+-------------+
| key          | value       |
+--------------+-------------+
| :precision   | 0.31578946  |
| :recall      | 0.012526096 |
| :accuracy    | 0.04892368  |
| :f1          | 0.024096385 |
| :f2          | 0.015503876 |
| :f0.5        | 0.054054055 |
+--------------+-------------+

Private Dataset #1

Now, let's test the first private dataset:

(table (generate-stats :js :execute :datasets ["http://kbpedia.org/datasets/private/1/"]))

True positives:  231
False positives:  109
True negatives:  19
False negatives:  152

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.67941177 |
| :recall      | 0.60313314 |
| :accuracy    | 0.4892368  |
| :f1          | 0.6390042  |
| :f2          | 0.61698717 |
| :f0.5        | 0.6626506  |
+--------------+------------+

Private Dataset #2

And, then, the second private dataset:

(table (generate-stats :js :execute :datasets ["http://kbpedia.org/datasets/private/2/"]))

True positives:  24
False positives:  21
True negatives:  19
False negatives:  447

+--------------+-------------+
| key          | value       |
+--------------+-------------+
| :precision   | 0.53333336  |
| :recall      | 0.050955415 |
| :accuracy    | 0.08414873  |
| :f1          | 0.093023255 |
| :f2          | 0.0622084   |
| :f0.5        | 0.1843318   |
+--------------+-------------+

Combined Public Datasets

A more realistic analysis is to use a combination of datasets. Let's see what happens to the performance metrics if we start combining the public datasets only.

Wikipedia + Freebase

First, let's start by combining Wikipedia and Freebase.

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://rdf.freebase.com/ns/"]))

True positives:  126
False positives:  60
True negatives:  19
False negatives:  306

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.67741936 |
| :recall      | 0.29166666 |
| :accuracy    | 0.28375733 |
| :f1          | 0.407767   |
| :f2          | 0.3291536  |
| :f0.5        | 0.53571427 |
+--------------+------------+

Adding the Freebase dataset to the DBpedia one has the following effects on the different metrics:

metric	Impact in %
precision	-0.03%
recall	+4.85%
accuracy	+3.57%
f1	+3.29%
f2	+4.34%
f0.5	+1.57%

As we can see, the impact of adding Freebase to the knowledge base is positive even if not ground breaking considering the size of the dataset.

Wikipedia + USPTO

Let's switch Freebase for the other specialized public dataset, USPTO (organizations with trademarks in the US Patent and Trademark Office dataset).

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"]))

True positives:  122
False positives:  59
True negatives:  19
False negatives:  311

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.67403316 |
| :recall      | 0.2817552  |
| :accuracy    | 0.27592954 |
| :f1          | 0.39739415 |
| :f2          | 0.31887087 |
| :f0.5        | 0.52722555 |
+--------------+------------+

Adding the USPTO dataset to the DBpedia one instead of Freebase has the following effects on the different metrics:

metric	Impact in %
precision	-0.83%
recall	+1.29%
accuracy	+0.73%
f1	+0.65%
f2	+1.07%
f0.5	+0.03%

As we may have expected, the gains are smaller than Freebase. This may be partly due to the fact that the USPTO dataset is smaller and more specialized than Freebase. Because it is more specialized (enterprises that have trademarks registered in US), maybe the gold standard doesn't represent well the organizations belonging to this dataset. But in any case, there are still gains.

Wikipedia + Freebase + USPTO

Let's continue and now include all three datasets.

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"
                                               "http://rdf.freebase.com/ns/"]))

True positives:  127
False positives:  62
True negatives:  19
False negatives:  303

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.6719577  |
| :recall      | 0.29534882 |
| :accuracy    | 0.2857143  |
| :f1          | 0.41033927 |
| :f2          | 0.3326349  |
| :f0.5        | 0.53541315 |
+--------------+------------+

Now let's see the impact of adding both Freebase and USPTO to the Wikipedia dataset:

metric	Impact in %
precision	+1.14%
recall	+6.18%
accuracy	+4.30%
f1	+3.95%
f2	+5.45%
f0.5	+1.51%

This combination of public datasets is a key baseline for the conclusions below.

Combined Public and Private Datasets

Now let's see the impact of using adding the private datasets. We will continue to use the combination of the three public datasets (Wikipedia, Freebase and USPTO) to which we will add the private datasets (PD #1 and PD #2).

Wikipedia + Freebase + USPTO + PD #1

We will first add one of the private datasets (PD #1).

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"
                                               "http://rdf.freebase.com/ns/"
                                               "http://kbpedia.org/datasets/private/1/"]))

True positives:  279
False positives:  102
True negatives:  19
False negatives:  111

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.7322835  |
| :recall      | 0.7153846  |
| :accuracy    | 0.58317024 |
| :f1          | 0.7237354  |
| :f2          | 0.7187017  |
| :f0.5        | 0.7288401  |
+--------------+------------+

When we compare these results to just the combination of the three public datasets, we get these percentage immprovements:

metric	Impact in %
precision	+8.97%
recall	+142.22%
accuracy	+104.09%
f1	+76.38%
f2	+116.08%
f0.5	+36.12%

If we run the private dataset #1 alone (not in combination with the public ones), we get these lesser improvements:

metric	Impact in %
precision	+7.77%
recall	+18.60%
accuracy	+19.19%
f1	+13.25%
f2	+16.50%
f0.5	+9.99%

So, while the highly targeted private dataset #1 performs better than the three combined public datasets, the combination of private dataset #1 and the three public ones shows still further improvements.

Wikipedia + Freebase + USPTO + PD #2

We can repeat this analysis, only now focusing on the second private dataset (PD #2). This first run combines the three public datasets with PD #2:

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"
                                               "http://rdf.freebase.com/ns/"
                                               "http://kbpedia.org/datasets/private/2/"]))

True positives:  138
False positives:  69
True negatives:  19
False negatives:  285

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.6666667  |
| :recall      | 0.32624114 |
| :accuracy    | 0.3072407  |
| :f1          | 0.43809524 |
| :f2          | 0.36334914 |
| :f0.5        | 0.55155873 |
+--------------+------------+

We can see that PD #2 in combination with the three public datasets does not perform as well as PD #1 added to the three public ones. This observation just affirms that not all of the private datasets have equivalent impact. Here are the percent differences for when PD #2 is added to the three public datasets vs. the three public datasets alone:

metric	Impact in %
precision	-0.78%
recall	+10.46%
accuracy	+7.52%
f1	+6.75%
f2	+9.23%
f0.5	+3.00%

In this case, we actually see that precision drops when adding PD #2, though accuracy is still improved.

Wikipedia + Freebase + USPTO + PD #1 + PD #2

Now that we have seen the impact of PD #1 and PD #2 in isolation, let's see what happens when we combine all of the public and private datasets. First, let's look at the raw metrics of the run:

(table (generate-stats :js :execute :datasets ["http://dbpedia.org/resource/"
                                               "http://www.uspto.gov"
                                               "http://rdf.freebase.com/ns/"
                                               "http://kbpedia.org/datasets/private/1/"
                                               "http://kbpedia.org/datasets/private/2/"]))

True positives:  285
False positives:  102
True negatives:  19
False negatives:  105

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.7364341  |
| :recall      | 0.7307692  |
| :accuracy    | 0.59491193 |
| :f1          | 0.7335907  |
| :f2          | 0.7318952  |
| :f0.5        | 0.7352941  |
+--------------+------------+

As before, let's look at the percentage changes due to adding both of the private datasets #1 and #2 to the three public datasets:

metric	Impact in %
precision	+9.60%
recall	+147.44%
accuracy	+108.22%
f1	+78.77%
f2	+120.02%
f0.5	+37.31%

Note, for all metrics, this total combination of all datasets performs best compared to any of the other tested combinations.

Adding Unknown Entities Tagger

There is one last feature with the publisher analyzer that we should highlight. The analyzer also allows us to identify unknown entities from the web page. (An "unknown entity" is identified as a likely organization entity, but which does not already exist in the KB.) Sometimes, it is the unknown entity that is the publisher of the web page. The usefulness of the unknown entity identification is to flag new possible entities (organizations, in this case) that should be considered for addition to the overall knowledge base.

(table (generate-stats :js :execute :datasets :all))

True positives:  345
False positives:  104
True negatives:  19
False negatives:  43

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.76837415 |
| :recall      | 0.88917524 |
| :accuracy    | 0.7123288  |
| :f1          | 0.82437277 |
| :f2          | 0.86206895 |
| :f0.5        | 0.78983516 |
+--------------+------------+

As we can see, the overall accuracy improved by 19.73% when considering the unknown entities compared to the public and private datasets.

metric	Impact in %
precision	+4.33%
recall	+21.67%
accuracy	+19.73%
f1	+12.37%
f2	+17.79%
f0.5	+7.42%

Discussion of Results

When we first tested the system with single datasets, some of them were scoring better than others for most of the metrics. However, does that mean that we could only use them and be done with it? No, what this analysis is telling us is that some datasets score better for this set of web pages. They cover more entities found in those web pages. However, even if a dataset was scoring lower it does not mean it is useless. In fact, that worse dataset may in fact cover one prediction area not covered in a better one, which means that by combining the two, we could improve the general prediction power of the system. This is what we can see by adding the private datasets to the public ones.

Even if the highly curated and domain-specific private datasets score much better than the more general public datasets, the system still greatly benefits from the contribution of the public datasets by significantly improving the accuracy of the system. We achieve a gain of 108% in accuracy by adding the private datasets to KBpedia's public ones. What this tells us is that KBpedia is a very useful structure and starting point for an entity tagging effort, but that adding domain data is probably essential to gain the overall accuracy desired for enterprise requirements.

Another thing that this series of tests tends to demonstrate is that the more knowledge we have, the more we can improve the accuracy of the system. Adding datasets doesn't appear to lower the overall performance of the system (even though we did see one case of a slight decrease in precision for PD #2 even though all other metrics improved). Generally, the more data available for a given task, the better the results.

Finally, adding a feature to the system can also greatly improve its overall accuracy. In this case, we added the feature of detecting unknown entities (organization entities that are not existing in the datasets that compose the knowledge base), which improves the overall accuracy by another 20%. How is that possible? To understand this we have to consider the domain: random web pages that exist on the Web. A web page can be published by anybody and any organization. This means that the long tail of web page publisher is probably pretty long. Considering this fact, it is normal that existing knowledge bases may not contain all of the obscure organizations that publish web pages. It is most likely why having a system that can detect and predict unknown entities as the publishers of web page will have a significant impact on the overall accuracy of the system. The flagging of such "unknown" entities tells us where to focus efforts to add to the known database of existing publishers.

Conclusion

As we saw in this analysis, adding high quality and domain-specific private datasets can greatly improve the accuracy of such a prediction system. Some datasets may have a more significant impact than others, but overall, each dataset contributes to the overall improvements of the predictions.

USE CASE