Natural Language Processing

Welcome to Health Gorilla's Developer Portal. Access API documentation for our suite of clinical data APIs.

We provide a natural language processing (NLP) API that structures clinical information from unstructured text and image files, such as jpg, png, and pdf files. This is part of our RESTful API and is based on the FHIR 3 protocol.

Resources

The workflow of our NLP API consists of four stages:

  • Add document to the patient’s chart. See “Health Gorilla FHIR RESTful API specification” for more details
  • Create Subscription to be notified about task completeness and receive results
  • Start NLP job for the given document
  • Receive results

Create Subscription

First, you should add a subscription to receive operation results. It should be a webhook. Health Gorilla sends HTTP POST to the specified URL, when operation is complete. Payload contains the result of the operation performed for the given document.

References:

Below you can find required attributes.

Attribute

Value

criteria

DocumentReference.OCR

channel.endpoint

TLS1.2 +

channel.payload

application/hg-ocr+json

channel.header

HTTP header that is used to authorise the request

Example:

{
    "resourceType": "Subscription",
    "status": "requested",
    "end": "2022-01-01T00:00:00Z",
    "reason": "NLP",
    "criteria": "DocumentReference.OCR",
    "channel": {
        "type": "rest-hook",
        "endpoint": "https://your-site.com/hg/webhook",
        "payload": "application/hg-ocr+json",
        "header": [
            "Authorization: Bearer your-secret-token"
        ]
    }
}

Start NLP job

Resources:

Health Gorilla defines a custom OCR operation applicable to the DocumentReference resource.

To extract text from the given document send HTTP GET:

https://api.healthgorilla.com/fhir/DocumentReference/DOCUMENT_ID/$ocr

To extract text and medical info from the given document send HTTP GET:

https://api.healthgorilla.com/fhir/DocumentReference/DOCUMENT_ID/$ocr?cm=true

If you have active subscriptions that can receive operation results, then you’ll get the HTTP 200 OK response that contains list of active subscriptions.

{
    "resourceType": "Parameters",
    "parameter": [
        {
            "name": "subscription",
            "valueId": "Subscription/2913165d0f881b6a0bd35afd"
        }
    ]
}

If you have not active subscriptions then you’ll get HTTP 405 Method Not Allowed response.

Receive the Results

Once operation completes you will receive the HTTP GET request to your Webhook endpoint.

Request will be in JSON format and consist of the attributes:

Attribute

Type

Description

resource

String

DocumentReference

id

String

The ID of the resource

ocr

Array

List of JSON Object

OCR Object

Attribute

Type

Description

name

String

The name of the document

contentType

String

Example: application/pdf

success

Boolean

True if the document was processed, false otherwise.

error_code

String

The error code in case of failure.

pages

Array

List of Page objects that contains text extracted from the document.

entities

Array

List of Entity objects

unmappedAttributes

Array

List of Attribute objects. This array includes list of specific attributes that were extracted but were not mapped to an entity.

Page Object

Attribute

Type

Description

number

String

Page number

lines

Array

List of lines found.

Line Object

Attribute

Type

Description

confidence

Float

The confidence score that stores the accuracy of the recognized text.
Minimum value of 0. Maximum value of 100.

text

String

page

Number

Entity Object

Entity provides information about an extracted medical term

Attribute

Type

Description

Id

Integer

Number identifier

Category

String

Category of the entity. The following entities are supported at the moment:

  • MEDICATION
  • MEDICAL_CONDITION
  • PROTECTED_HEALTH_INFORMATION
  • TEST_TREATMENT_PROCEDURE
  • ANATOMY

Attributes

Array

List of Attribute objects that relate to this entity. Dependent on entity category

BeginOffset

Integer

0-based offset in the input text that shows where the entity starts

EndOffset

Integer

0-based offset in the input text that shows where the entity ends

Score

Float

The level of confidence in the accuracy of the detection

Text

String

Segment of input text extracted as the entity

Traits

Array

Array of Traits objects

Attribute Object

An extracted segment of the text that is an attribute of an entity, or otherwise related to an entity, such as the dosage of a medication taken. It contains information about the attribute such as id, begin and end offset within the input text, and the segment of the input text

Attribute

Type

Description

Id

Integer

Number identifier

BeginOffset

Integer

0-based offset in the input text that shows where the entity starts

EndOffset

Integer

0-based offset in the input text that shows where the entity ends

Score

Float

The level of confidence in the accuracy of the detection

RelationshipScore

Float

The level of confidence in the accuracy that attribute relates to the entity

Text

String

Segment of input text extracted as the entity

Traits

Array

Array of Traits objects

Type

String

Specific type of entity. Currently the following types are supported:

  • NAME
  • DOSAGE
  • FORM
  • FREQUENCY
  • DURATION
  • GENERIC_NAME
  • BRAND_NAME
  • STRENGTH
  • RATE
  • TEST_NAME
  • TEST_UNITS
  • PROCEDURE_NAME
  • TREATMENT_NAME
  • DATE
  • AGE
  • CONTACT_POINT
  • EMAIL
  • IDENTIFIER
  • URL
  • ADDRESS
  • SYSTEM_ORGAN_SITE
  • DIRECTION
  • QUALITY
  • QUANTITY

Trait Object

Provides contextual information about the extracted entity.

Attribute

Type

Description

Name

String

Name or contextual description about the trait

Score

Float

The level of confidence in the accuracy of the detection

Example

{  
   "resource":"DocumentReference",
   "id":"7e00155db23e480d67609b59",
   "ocr":[  
      {  
         "name":"laboratory_result.pdf",
         "contentType":"application/pdf",
         "Success":true,
         "pages":[  
            {  
               "number":1,
               "lines":[  
                  {  
                     "confidence":55.2390022277832,
                     "text":"From",
                     "page":1
                  },
                  {  
                     "confidence":93.70309448242188,
                     "text":"Mon 05 Nov 2012 12:10:19 PM PST",
                     "page":1
                  },
                  ...
            },
            ...
         ],
         "entities":[  
            {  
               "resource":"Entity",
               "id":0,
               "category":"PROTECTED_HEALTH_INFORMATION",
               "type":"DATE",
               "text":"Mon 05 Nov 2012",
               "score":0.99408334
            },
            {  
               "resource":"Entity",
               "id":52,
               "category":"MEDICAL_CONDITION",
               "type":"DX_NAME",
               "text":"venous thrombosis",
               "score":0.9674222,
               "traits":[  
                  {  
                     "resource":"Trait",
                     "name":"DIAGNOSIS",
                     "score":0.9529462
                  }
               ]
            },
            ...
         ]
      }
   ]
}

Limits

Health Gorilla NLP API has set of restrictions.

  • Only PDF, JPEG, PNG documents can be processed
  • The maximum document image (JPEG/PNG) size is 5 MB.
  • The maximum PDF file size is 500 MB.
  • The maximum number of pages in a PDF file is 3000