ckip_transformers.nlp.driver module

This module implements the CKIP Transformers NLP drivers.

class ckip_transformers.nlp.driver.CkipWordSegmenter(level: int = 3, **kwargs)[source]

The word segmentation driver.

Parameters

level (str optional, defaults to 3, must be 1—3) – The model level. The higher the level is, the more accurate and slower the model is.
model_name (str optional, overwrites level) – The pretrained model name (e.g. 'ckiplab/bert-base-chinese-ws').
device (int, optional, defaults to -1,) – Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.

__call__(input_text: List[str], *, use_delim: bool = False, **kwargs) → List[List[str]][source]

Call the driver.

Parameters

input_text (List[str]) – The input sentences. Each sentence is a string.
use_delim (bool, optional, defaults to False) – Segment sentence (internally) using delim_set.
delim_set (str, optional, defaults to '，,。：:；;！!？?') – Used for sentence segmentation if use_delim=True.
batch_size (int, optional, defaults to 256) – The size of mini-batch.
max_length (int, optional) – The maximum length of the sentence, must not longer then the maximum sequence length for this model (i.e. tokenizer.model_max_length).
show_progress (int, optional, defaults to True) – Show progress bar.
pin_memory (bool, optional, defaults to True) – Pin memory in order to accelerate the speed of data transfer to the GPU. This option is incompatible with multiprocessing.

Returns

List[List[str]] – A list of list of words (str).

class ckip_transformers.nlp.driver.CkipPosTagger(level: int = 3, **kwargs)[source]

The part-of-speech tagging driver.

Parameters

level (str optional, defaults to 3, must be 1—3) – The model level. The higher the level is, the more accurate and slower the model is.
model_name (str optional, overwrites level) – The pretrained model name (e.g. 'ckiplab/bert-base-chinese-pos').
device (int, optional, defaults to -1,) – Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.

__call__(input_text: List[List[str]], *, use_delim: bool = True, **kwargs) → List[List[str]][source]

Call the driver.

Parameters

input_text (List[List[str]]) – The input sentences. Each sentence is a list of strings (words).
use_delim (bool, optional, defaults to True) – Segment sentence (internally) using delim_set.
delim_set (str, optional, defaults to '，,。：:；;！!？?') – Used for sentence segmentation if use_delim=True.
batch_size (int, optional, defaults to 256) – The size of mini-batch.
max_length (int, optional) – The maximum length of the sentence, must not longer then the maximum sequence length for this model (i.e. tokenizer.model_max_length).
show_progress (int, optional, defaults to True) – Show progress bar.
pin_memory (bool, optional, defaults to True) – Pin memory in order to accelerate the speed of data transfer to the GPU. This option is incompatible with multiprocessing.

Returns

List[List[str]] – A list of list of POS tags (str).

class ckip_transformers.nlp.driver.CkipNerChunker(level: int = 3, **kwargs)[source]

The named-entity recognition driver.

Parameters

level (str optional, defaults to 3, must be 1—3) – The model level. The higher the level is, the more accurate and slower the model is.
model_name (str optional, overwrites level) – The pretrained model name (e.g. 'ckiplab/bert-base-chinese-ner').
device (int, optional, defaults to -1,) – Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.

__call__(input_text: List[str], *, use_delim: bool = False, **kwargs) → List[List[ckip_transformers.nlp.util.NerToken]][source]

Call the driver.

Parameters

input_text (List[str]) – The input sentences. Each sentence is a string or a list or string (words).
use_delim (bool, optional, defaults to False) – Segment sentence (internally) using delim_set.
delim_set (str, optional, defaults to '，,。：:；;！!？?') – Used for sentence segmentation if use_delim=True.
batch_size (int, optional, defaults to 256) – The size of mini-batch.
max_length (int, optional) – The maximum length of the sentence, must not longer then the maximum sequence length for this model (i.e. tokenizer.model_max_length).
show_progress (int, optional, defaults to True) – Show progress bar.
pin_memory (bool, optional, defaults to True) – Pin memory in order to accelerate the speed of data transfer to the GPU. This option is incompatible with multiprocessing.

Returns

List[List[NerToken]] – A list of list of entities (NerToken).