URL matching via Origin hash
Locating a specific review from a specific product amongst all the data can sometimes feel like searching for a needle in a haystack. Recently, Socialgist began offering a new piece of metadata which helps solve this challenge.
The Problem
A single review can be represented by a number of unique URLs that all point to the same content. For example, the following URLs could all be used to track a single product.
https://www.amazon.com/Panasonic-Electric-ES-LV95-S-Flexible-Automatic/dp/B00FPQ70Z2
https://www.amazon.com/dp/B00FPQ70Z2
https://www.amazon.com/Panasonic-Electric-ES-LV95-S-Flexible-Automatic/product-reviews/B00FPQ70Z2/
Searching by URL is a common technique to locate review data, thus this can present a question of what URL is best used to search for reviews specific to this product. This can be a challenging question to answer as our crawlers are configured to handle hundreds of sites and each of these sites has their own rules and criteria. It can be difficult to communicate how to properly locate data on such a wide variety of sources.
Origin Hash
For targeted crawling done by Socialgist, our team imports lists of URLs. The import process is designed to sanitize and transform these URLs into the expected formats used by our crawlers. This can often times result in the crawled and delivered data having a different URL than what was originally submit. In cases where the end user is looking for an exact URL match on their original URL, they will fail to locate the review data.
The Socialgist team noted this was a common challenge to our users and sought to provide additional metadata to link URL submissions to the final delivered data. This lead to the introduction of the origin field.
Origin is a hashed value of the original URL submission. The intended use case is for the Socialgist user to take an md5 hash of their URL prior to submitting to SG for addition. The Socialgist team will also take an md5 hash of this URL prior to any transformation to the URL. After the URL has been transformed into the proper format for crawling, the origin hash value remains attached to it.
Usage
Let's take a use case where the URL to be tracked is as follows.
https://www.amazon.com/Panasonic-Electric-ES-LV95-S-Flexible-Automatic/product-reviews/B00FPQ70Z2/
In this example, the SG user would want to setup some internal process to create a hash of each URL they are submitting to Socialgist and attach the hashed value with the URL. For the sake of the example, we can look at this more closely and do this manually.
Using an online hashing tool found here https://www.md5hashgenerator.com/, we can take a hash of this URL and generate the following hash c64d84310ac9d129efa6f5ad0faab72c. We have now established the relationship between the URL and the origin hash.
When the SG support team runs this URL through our import process, the importer will also generate the same md5 hash and connect that hash to that URL. On the SG side, the URL may have undergone additional transformation to make it more suitable for crawling, but the hash will remain unchanged and still match what the SG user had also generated.
Upon the requested URL being crawled, review data is generated. This review data is delivered along with the original hash value that would have been calculated prior to submission to Socialgist. This hash can be used to filter out specifically the reviews you are looking for amongst all the other data. This method can also decouple the need to know specifically what URL to search for to locate a piece of review data.
Stream Example
To help visualize the data, lets also look at an example record that might come through in your feed data.
This record is partially stripped down for brevity.
{
"review": {
"reference": {
"requestId": "1999",
"projectId": "",
"origin": "c64d84310ac9d129efa6f5ad0faab72c",
"reviewProduct": "premium"
},
...
"url": "https://www.amazon.com/Panasonic-Electric-ES-LV95-S-Flexible-Automatic/product-reviews/B00FPQ70Z2/?sortBy=bySubmissionDateDescending&reviewerType=all_reviews&formatType=current_format",
...
"item": "Panasonic Arc5 Electric Razor for Men, 5 Blades Shaver and Trimmer - Sensor Technology, Automatic Clean",
"itemURL": "https://www.amazon.com/Panasonic-Electric-ES-LV95-S-Flexible-Automatic/dp/B00FPQ70Z2/",
}
}You may notice that the url parameter is close but does not match the originally requested URL. The itemURL field also does not match the originally submit URL. But the origin hash does allow us to map back to the original submission.