Classification: how does it actually work?

In one of the previous "What’s new in GScan Service" blog posts, we have introduced a new feature called Classification. This feature allows us to have the documents sorted automatically based on their “full-text fingerprint”. After the documents are separated into classification categories, it is possible to have them processed by a scan job, that is assigned to that particular category. This all sounds nice but…how does it actually work? Let’s demonstrate an example.

3. 8. 2020 Tomáš Jurik

For our example, we have created a simple scenario, where we have an HR office with an ongoing hiring process. In the current phase, the HR office is receiving CVs, filled job application forms and background check authorizations from candidates via email. For the given documents, there is a SharePoint library created with three content types (CV, ApplicationForm, and BackgroundCheckAuthorization). Our objective is to sort the documents, extract the required metadata, and to export them as follows:

Resume (CV) – store the document into the SharePoint library as its own content type. Required metadata: First name, Surname, Email, Phone number.

Job application form – store into the same SharePoint library as its own content type. Metadata are also exported into an internal database. Required metadata: First Name, Surname, Position, E-mail, Phone number.

Background check authorization - store into the same SharePoint library as its own content type and send it via email to the legal department.

The following diagram indicates the way we will prepare our example solution:


First, we have created classification categories as represented in the image below. HR_docs is the main category. The subcategories are named after the document types the office is going to process.

We have also “taught” the system how each document type looks like by simply clicking the Upload documents for learning button and choosing the sample document in the file system.

Once the classification categories structure was established, we have built a scan job for each document type, because we need to have them processed in a bit different way each.

For resumes, the scan job loads the document and starts the processing. The metadata is extracted manually utilizing the Click-to-index feature due to the sheer variety of the resume formatting and layouts. Once the data are extracted, the document is exported into the prepared SharePoint library using the SharePoint plugin. Also, we decided to put the resumes into a subfolder named after its classification category.

Job application forms and background check authorization forms are processed almost in the same way as resumes with just a few differences. DB plugin was used in application forms scan job to secure the export of metadata into an internal database. Email plugin was used to send the document with extracted metadata to the legal department mailbox.

When all three scan jobs were completed, we could approach the classification job creation. A respective scan job has been assigned to each category, which will provide that each classified document will be sent to a scan job, that is bound to its category.

Usage demonstration:

The jobs are published and loaded, the processing is started, and we can look at how it works in praxis.

An email with the attached documents is received form a candidate. The Email connector takes care of attachment extraction and copies it into the import folder, which is monitored by GScan Service.

GScan Service picks up any valid document from the import folder and sends it to the classification job.

The classification job takes the documents and compares them to a “full-text fingerprint” that was created from the sample documents uploaded during preparations. It will then offer the best match during the verification. If the match reaches a threshold defined in the classification job, the document name will be green and we can confirm the batch (in case a verification is turned off, this batch will be confirmed automatically). If the threshold was not reached, the document name will be in red and the user will have to verify if the category is or is not correct (in case the verification is turned off, this batch ends up in a verification error and must be verified manually). If it is correct, the batch can be confirmed; if it is not, we can choose a correct category from a dropdown menu. The document name will turn blue in this case, which means the category was changed and the system will record, which category the document belongs in.

After the batch is confirmed in the verification of classification, each document is sent to a scan job that is assigned to its category and is processed accordingly.

Note: In case there is more than one document in the batch, the documents will be separated and sent to their respective scan jobs individually as separate batches.

Get In Touch

Request a free trial or a quote. Ask us anything via chat or leave us a message in email.

Contact GScan Team

This site uses cookies to help deliver services

By using this site, you agree to the use of cookies as described in our Privacy Policy.