Regular Expression Entity Type
On This Page
Overview
A regular expression is a pattern used to identify text. It allows you to have very fine-grained control over what content DryvIQ detects. The pattern must be constructed according to regular expression standards. There are many online resources that explain how to construct a regular expression.
When creating an entity in DryvIQ using a regular expression, you can add one or many patterns to ensure the entity type matches exactly what you want to find. You will specify both the pattern and the confidence level for each pattern. You can further improve match accuracy by adding keywords and validation to the entity type.
Entity Type Details
Selecting to edit the Entity type details allows you to enter a description for the entity type you are creating. Separate from the name used to search for the entity type in the application, the description helps your users understand what the entity type aims to accomplish with the rules and validation you select. The description is limited to 256 characters.
Category
Editing the Entity type details also lets you change the category assigned to the entity type. The category identifies the type of data being detected. The Category list includes the five default categories (Financial, General, Privacy, Regulation, and Technology) that come added in the application, as well as any custom categories you have created. Preinstalled entity types will be assigned to their corresponding categories. All custom entity types default to “General,” so you will need to edit the category if a specific category needs to be used for an entity type. (See Managing Categories for information about creating and managing custom categories.)
Regex
This section is where you will build your regular expression pattern and assign a confidence level.
Description
The Description is a user-defined name for the pattern you are going to use. This helps identify the pattern. While this is an optional field for the regular expression patterns you add, DryvIQ recommends adding a description, as it makes it easier for other users to understand the pattern when reviewing the information.
Regex Pattern
The Regex pattern is the regular expression pattern you want to use for the entity type. Again, the pattern must be constructed according to regular expression standards.
The timeout for matching regex patterns is 5 minutes per pattern in an entity type. If a match takes longer than the timeout, an error will be logged in the policy's Activity report, indicating that the regex engine timed out.
Confidence
The confidence level lets you control how many false positives you are willing to tolerate. The Confidence list displays the available levels. Each confidence level corresponds to a threshold (or a probability in machine-learning-based models) used throughout the rest of the entity-type model.
The confidence level mapping is as follows:
None= 0
Very weak = 0.05
Weak = 0.3
Medium = 0.5
Strong = 0.7
Very strong = 0.85
Select Add regex pattern to add additional patterns. You can add as many patterns to the entity type as you like to help strengthen the match.
Validations
Like keywords, validations help improve match success. DryvIQ comes with preinstalled validation rules that verify social security numbers, checksums, driver’s license numbers, and more. You can select validation rules from the list, and DryvIQ will run all matches against them. Any match that fails validation will be automatically filtered from the result list. For example, if you set a credit card pattern to detect credit card numbers, you should also enable Luhn Check validation to ensure the matches are valid credit card numbers. This extra validation reduces the number of false-positive matches you need to sort through. Again, adding validations increases match success by the specified percentage.
Keyword Proximity
You can further improve match accuracy by providing a list of keywords that may appear near the entity you want to identify. A term is considered in close proximity if it is within 5 words before the match by default. These keywords boost the confidence level of a given match. In the example, the confidence level for the serial number pattern may be only 0.5 (medium); however, adding keywords increases it by 35% (as indicated by the green percentage on the right of this section).
Keywords
Keywords can be added manually to the Keywords field or imported from a CSV file. When manually adding keywords, you can enter the terms as comma-separated values, or you can add each keyword on a new line.
Keywords cannot contain the following characters: ~ ` ! @ # $ % ^ & * ( ) = { } [ ] | \ : ; " ' < > ? . /
See Managing Keywords for information about managing keywords added to the entity type.
Maximum Distance From Match
These fields allow you to set a custom keyword proximity. By default, a term within 5 words before the regular expression pattern will trigger a match. You can edit the field to specify the distance you prefer. You can also turn on proximity to search for keywords after the regular expression pattern and specify the value you want to use.
Clearing the checkbox for the words before or words after the proximity field disables the proximity search in that direction. You should not disable both fields, as doing so disables keyword matching.
Negative Keyword Proximity
The Negative keyword list is an explicit list of words or phrases that should prevent a match when detected in the vicinity of the regular expression pattern, helping reduce false positives. For an upload against an entity type, the match confidence will be 0% if a negative keyword is found. For a policy, the presence of a negative keyword, even if other validation and keywords are present, will prevent the item from being matched and assigned to the corresponding tracking group.
Negative Keywords
Negative keywords can be manually added to the Negative keywords field or imported using a CSV file. When manually adding keywords, you can enter the terms as comma-separated values, or you can add each keyword on a new line.
Maximum Distance From Match
These fields allow you to set a custom keyword proximity. By default, a term within 5 words before the regular expression pattern will trigger a match confidence adjustment. You can edit the field to specify the distance you prefer. You can also turn on proximity to search for keywords after the regular expression pattern and specify the value you want to use. Clearing the checkbox for the words before or words after the proximity field disables the proximity search in that direction. You should not disable both fields, as doing so disables keyword matching.
