| RESPITE: The CASA Toolkit Page: Documentation: Block Library Index:ColourMask |
The ColourMask block performs a `pseudo-grouping' on a discrete 1/0 missing data mask to perform a set of numbered groups. The output is a time-frequency map in which each point is labelled with an integer label indicating the group to which it has been assigned. This map may be used as input to the multisource decoder.
The `pseudo-grouping' is based on locating and labelling contiguous regions of 1's in the missing data mask. Time-frequency points are considered contiguous if they are adjacent in either time or frequency (i.e. joined either `horizontally' or `vertically' but not `diagonally' in the 2-D time-frequency map). Each separated region is assigned a unique group number. As an option the missing data mask may be split into sub-bands which are treated independently when searching for contiguous regions.
The block may also perform additional group splitting based on optional pitch and voicing estimate inputs. If the voicing input is connected then groups will be split at frames where the voicing parameter crosses the VOICING_THRESHOLD (i.e. at such frames all existing groups will end and new groups will begin). If the pitch input is supplied then groups will also be split at voiced frames in which the pitch changes by more than the DELTA_PITCH_THRESHOLD. (If the voicing input is not connected then all frames will be treated as though they are voiced).
After the contiguous region grouping, a higher level of common onset/common offset grouping may be applied to the groups that have been located. In this case disconnected groups that start (or end) at the same time frame are merged into a common group.
The grouping algorithm is controlled by the following 9 parameters:
To perform an exact labelling of each separate contiguous region of 1's in the 1/0 mask is only possible if the entire mask is visible. However, CTK blocks are designed to run in an `online' mode where data output `keeps up with' data input i.e. the colouring block does not wait until the end of the utterance to produce its result. The algorithm operates on a window of the data (by default 5 frames) and the grouping decisions and output will lag behind the input by the size of this window. As the algorithm cannot look ahead further than the size of this window errors can occur if two groups that appear separate in the window merge at a later time. Increasing the size of the window will make these errors less likely at the expense of a greater lag time in the output.
If set to TRUE then common onset grouping is applied to merge separate groups that happen to start at the same time frame.
If set to TRUE then common offset grouping is applied to merge separate groups that happen to end at the same time frame.
Offset grouping is a little more complicated than onset grouping as it is by nature retroactive - the decision to merge two groups can not occur until both groups have ended. This means that by the time the decision to merge occurs, data frames containing the beginning of the groups may have been passed through the analysis window and out of the block. Once part of a group has been passed out of the colour block its label cannot be changed. So, in the current implementation offset grouping is only applied if at the point of offset at least one of the two groups has an onset within the analysis window (i.e. the group is fully contained within the window).
Missing data masks typically contain `speckled' regions which will be interpreted by the grouping algorithm as large numbers of groups each containing very few points. This proliferation of groups can cause performance problems during multisource decoding. As a way of avoiding this groups containing less than MIN_GROUP_SIZE time-frequency points are automatically merged into a common group. This group is always given the label `1'.
Note, if MIN_GROUP_SIZE is 1 or less then the feature has no effect.
If this feature is used then groups are allowed to grow from frame to frame only up to a maximum size of MAX_GROUP_SIZE points. At the first frame where MAX_GROUP_SIZE is exceeded the group is terminated and a new group is started (i.e. large groups are sliced up into short segments).
The feature is disabled when MAX_GROUP_SIZE is set to 0.
If NUM_SUBBANDS is set greater than 1 then the input data frames are divided up into subbands of equal width and the grouping algorithm is applied independently within each band. The full band labelling data is reconstituted from the parallel subband labellings on output.
The HAS_DELTAS switch has to be set so the block knows how to interpret the input data.
If the input missing data mask has deltas then HAS_DELTAS should be set to true. In this case the the grouping algorithm will be applied only to the non-delta features (i.e. the features with the lower half of the frame). Points in the delta mask (i.e. the upper half of the frame) will then be labelled with the group number of the corresponding non-delta spectro-temporal points.
This is used in conjunction with the optional voicing input. If the voicing input crosses the voicing threshold then the voicing state changes and groups will be split.
This is used in conjunction with the optional pitch input. If the pitch input changes from one frame to the next by more than the delta pitch threshold, then groups will be split.
| Inputs | Meaning | Sample | 1-D frame | |
|---|---|---|---|---|
| in1 | 1/0 missing data mask frames | No | Yes | No |
| (in2) | Degree of voicing | Yes | No | No |
| (in3) | Pitch estimate | Yes | No | No |
| Outputs | Meaning |
|---|---|
| out1 | labelled group frames |
| Parameters | Type | Default | Meaning |
|---|---|---|---|
| WINDOW_SIZE | Integer | 5 | Number of frames in running buffer |
| ONSET_GROUPING | Boolean | False | Perform common onset grouping ? |
| OFFSET_GROUPING | Boolean | False | Perform common offset grouping ? |
| MIN_GROUP_SIZE | Integer | 0 | (see above) |
| MAX_GROUP_SIZE | Integer | 0 | 0 = no max size (see above) |
| NUM_SUBBANDS | Integer | 1 | Number of subbands |
| HAS_DELTAS | Boolean | False | Should be set to TRUE if input data includes deltas |
| VOICING_THRESHOLD | Float | 0.5 | Threshold for discriminating voiced/unvoiced frames |
| DELTA_PITCH_THRESHOLD | Float | 10 | Max pitch change allowed before groups will be split |