Search Using Structured Data
Structured Data is information that is formatted into a repository that a search engine can read easily. Some examples include XML markup in XML sitemaps and schema vocabulary found in JSON-LD scripts. It is distinct from semi-structured, and unstructured data that have less formatting.
A search engine that answers questions based upon crawling and indexing facts found within structured data on a site works differently than a search engine which looks at the words used in a query, and tries to return documents using unstructured data which contains the same words as the ones in the query; hoping that such a matching of strings might contain an actual answer to the informational need that inspired the query in the first place. Search using Structured Data works a little differently, as seen in this flowchart from a 2017 Google patent:
This newer patent tells us that it might solve that book search in this manner:
In particular, for each encoded data item associated with a given identified schema, the system searches the locations in the encoded data item identified by the schema as storing values for the specified keys to identify encoded data items that store values for the specified keys that satisfy the requirements specified in the query. For example, if the query is for semi-structured data items that have a value “Ernest Hemingway” for an “author” key and that have values in a range of “1948-1952” for a “year published” key, the system can identify encoded data items that store a value corresponding to “Ernest Hemingway” in the location identified in the schema associated with the encoded data item as storing the value for the “author” key and that store a value in the range from “1948-1952” in the location identified in the schema associated with the encoded data item as storing the value for the “year published” key. Thus, the system can identify encoded data items that satisfy the query efficiently, i.e., without searching encoded data items that do not include values for each key specified in the received query and without searching locations in the encoded data items that are not identified as storing values for the specified keys.
Structured Data and JSON-LD
It was interesting seeing Google come out with a patent about searching semi-structured data which focused upon the use of JSON-LD. We see them providing an example of JSON on one of the Google Developer’s pages at
As it tells us on that page:
This documentation describes which fields are required, recommended, or optional for structured data with special meaning to Google Search. Most Search structured data uses schema.org vocabulary, but you should rely on the documentation on developers.google.com as definitive for Google Search behavior, rather than the schema.org documentation. Attributes or objects not described here are not required by Google Search, even if marked as required by schema.org.
The page then points us to the
Machine Readable Extraction of Facts
I’ve used the analogy of how XML sitemaps are machine-readable, compared to HTML Sitemaps, and that is how JSON-LD shows off facts in a machine-readable way on a site, as opposed to content that is in HTML format. As the patent tells us that is the purpose of this patent:
In general, this specification describes techniques for extracting facts from collections of documents.
The patent discusses schemas that might be on a site, and key/value pairs that could be searched, and details about such a search of semi-structured data on a site:
The aspect further includes receiving a query for semi-structured data items, wherein the query specifies requirements for values for one or more keys; identifying schemas from the plurality of schemas that identify locations for values corresponding to each of the one or more keys; for each identified schema, searching the encoded data items associated with the schema to identify encoded data items that satisfy the query; and providing data identifying values from the encoded data items that satisfy the query in response to the query. Searching the encoded data items associated with the schema includes: searching, for each encoded data item associated with the schema, the locations in the encoded data item identified by the schema as storing values for the specified keys to identify whether the encoded data item stores values for the specified keys that satisfy the requirements specified in the query.
The patent providing details of the use of JSON-LD to provide a machine-readable set of facts on a site can be found here:
Inventors: Martin Probst
Assignee: Google Inc.
US Patent: 9,754,048
Granted: September 5, 2017
Filed: October 6, 2014
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for storing semi-structured data. One of the methods includes maintaining a plurality of schemas; receiving a first semi-structured data item; determining that the first semi-structured data item does not match any of the schemas in the plurality of schemas; and in response to determining that the first semi-structured data item does not match any of the schemas in the plurality of schemas: generating a new schema, encoding the first semi-structured data item in the first data format to generate the first new encoded data item in accordance with the new schema, storing the first new encoded data item in the data item repository, and associating the first new encoded data item with the new schema.
Take Aways on Structured Data Use
By using Structured Data such as in Schema Vocabulary in JSON-LD formatting, you make sure that you provide precise facts in key/value pairs that provide an alternative to the HTML-based content on the pages of a site. Make sure that you follow the
And if you are optimizing a site for Google, it also helps to optimize the same site for Bing, and it is good to see that Bing seems to like JSON-LD too. It has taken a while for Bing to do that (see Aaron Bradley’s post,
Copyright © 2019