CKIP Transformers¶
CKIP Transformers¶
Documentation¶
Contributers¶
Wei-Yun Ma at CKIP (Maintainer).
Models¶
- Language Models
ALBERT Tiny:
ckiplab/albert-tiny-chinese
ALBERT Base:
ckiplab/albert-base-chinese
BERT Base:
ckiplab/bert-base-chinese
GPT2 Base:
ckiplab/gpt2-base-chinese
- NLP Task Models
ALBERT Tiny — Word Segmentation:
ckiplab/albert-tiny-chinese-ws
ALBERT Tiny — Part-of-Speech Tagging:
ckiplab/albert-tiny-chinese-pos
ALBERT Tiny — Named-Entity Recognition:
ckiplab/albert-tiny-chinese-ner
ALBERT Base — Word Segmentation:
ckiplab/albert-base-chinese-ws
ALBERT Base — Part-of-Speech Tagging:
ckiplab/albert-base-chinese-pos
ALBERT Base — Named-Entity Recognition:
ckiplab/albert-base-chinese-ner
BERT Base — Word Segmentation:
ckiplab/bert-base-chinese-ws
BERT Base — Part-of-Speech Tagging:
ckiplab/bert-base-chinese-pos
BERT Base — Named-Entity Recognition:
ckiplab/bert-base-chinese-ner
Model Usage¶
pip install -U transformers
ckiplab/albert-tiny-chinese
and ckiplab/albert-tiny-chinese-ws
by any model you need in the following example.ckiplab/albert-tiny-chinese
與 ckiplab/albert-tiny-chinese-ws
替換成任何您要使用的模型名稱。from transformers import (
BertTokenizerFast,
AutoModelForMaskedLM,
AutoModelForCausalLM,
AutoModelForTokenClassification,
)
# masked language model (ALBERT, BERT)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above
# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above
# nlp task model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above
Model Fine-Tunning¶
https://github.com/huggingface/transformers/tree/master/examples
https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling
https://github.com/huggingface/transformers/tree/master/examples/pytorch/token-classification
--tokenizer_name bert-base-chinese
in order to use Chinese tokenizer.--tokenizer_name bert-base-chinese
以正確的使用中文的 tokenizer。python run_mlm.py \
--model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
--tokenizer_name bert-base-chinese \
...
python run_ner.py \
--model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
--tokenizer_name bert-base-chinese \
...
Model Performance¶
Model |
#Parameters |
Perplexity† |
WS (F1)‡ |
POS (ACC)‡ |
NER (F1)‡ |
---|---|---|---|---|---|
ckiplab/albert-tiny-chinese |
4M |
4.80 |
96.66% |
94.48% |
71.17% |
ckiplab/albert-base-chinese |
10M |
2.65 |
97.33% |
95.30% |
79.47% |
ckiplab/bert-base-chinese |
102M |
1.88 |
97.60% |
95.67% |
81.18% |
ckiplab/gpt2-base-chinese |
102M |
14.40 |
– |
– |
– |
voidful/albert_chinese_tiny |
4M |
74.93 |
– |
– |
– |
voidful/albert_chinese_base |
10M |
22.34 |
– |
– |
– |
bert-base-chinese |
102M |
2.53 |
– |
– |
– |
Training Corpus¶
- CNA: https://catalog.ldc.upenn.edu/LDC2011T13
- Chinese Gigaword Fifth Edition — CNA (Central News Agency) part.中文 Gigaword 第五版 — CNA(中央社)的部分。
- ASBC: http://asbc.iis.sinica.edu.tw
- Academia Sinica Balanced Corpus of Modern Chinese release 4.0.中央研究院漢語平衡語料庫第四版。
- OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19
Dataset |
#Documents |
#Lines |
#Characters |
Line Type |
---|---|---|---|---|
CNA |
2,559,520 |
13,532,445 |
1,219,029,974 |
Paragraph |
ZhWiki |
1,106,783 |
5,918,975 |
495,446,829 |
Paragraph |
ASBC |
19,247 |
1,395,949 |
17,572,374 |
Clause |
OntoNotes |
1,911 |
48,067 |
1,568,491 |
Sentence |
CNA+ZhWiki |
#Documents |
#Lines |
#Characters |
---|---|---|---|
Train |
3,606,303 |
18,986,238 |
4,347,517,682 |
Dev |
30,000 |
148,077 |
32,888,978 |
Test |
30,000 |
151,241 |
35,216,818 |
ASBC |
#Documents |
#Lines |
#Words |
#Characters |
---|---|---|---|---|
Train |
15,247 |
1,183,260 |
9,480,899 |
14,724,250 |
Dev |
2,000 |
52,677 |
448,964 |
741,323 |
Test |
2,000 |
160,012 |
1,315,129 |
2,106,799 |
OntoNotes |
#Documents |
#Lines |
#Characters |
#Named-Entities |
---|---|---|---|---|
Train |
1,511 |
43,362 |
1,367,658 |
68,947 |
Dev |
200 |
2,304 |
93,535 |
7,186 |
Test |
200 |
2,401 |
107,298 |
6,977 |
NLP Tools¶
(WS) Word Segmentation 斷詞
(POS) Part-of-Speech Tagging 詞性標記
(NER) Named Entity Recognition 實體辨識
NLP Tools Usage¶
1. Import module¶
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker
2. Load models¶
# Initialize drivers
ws_driver = CkipWordSegmenter(level=3)
pos_driver = CkipPosTagger(level=3)
ner_driver = CkipNerChunker(level=3)
# Initialize drivers with custom checkpoints
ws_driver = CkipWordSegmenter(model_name='path_to_your_model')
pos_driver = CkipPosTagger(model_name='path_to_your_model')
ner_driver = CkipNerChunker(model_name='path_to_your_model')
# Use CPU
ws_driver = CkipWordSegmenter(device=-1)
# Use GPU:0
ws_driver = CkipWordSegmenter(device=0)
3. Run pipeline¶
# Input text
text = [
'傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。',
'美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。',
'空白 也是可以的~',
]
# Run pipeline
ws = ws_driver(text)
pos = pos_driver(ws)
ner = ner_driver(text)
',,。::;;!!??'
while running the model. (The output sentences will be concatenated back.) You may set delim_set
to any characters you want.use_delim=False
to disable this feature, or set use_delim=True
in WS and NER driver to enable this feature.',,。::;;!!??'
等字元在執行模型前切割句子(輸出的句子會自動接回)。可設定 delim_set
參數使用別的字元做切割。use_delim=False
已停用此功能,或於斷詞、實體辨識時指定 use_delim=False
已啟用此功能。# Enable sentence segmentation
ws = ws_driver(text, use_delim=True)
ner = ner_driver(text, use_delim=True)
# Disable sentence segmentation
pos = pos_driver(ws, use_delim=False)
# Use new line characters and tabs for sentence segmentation
pos = pos_driver(ws, delim_set='\n\t')
batch_size
and max_length
to better utilize you machine resources.batch_size
與 max_length
以更完美的利用您的機器資源。# Sets the batch size and maximum sentence length
ws = ws_driver(text, batch_size=256, max_length=512)
4. Show results¶
# Pack word segmentation and part-of-speech results
def pack_ws_pos_sentece(sentence_ws, sentence_pos):
assert len(sentence_ws) == len(sentence_pos)
res = []
for word_ws, word_pos in zip(sentence_ws, sentence_pos):
res.append(f'{word_ws}({word_pos})')
return '\u3000'.join(res)
# Show results
for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):
print(sentence)
print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
for entity in sentence_ner:
print(entity)
print()
傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
傅達仁(Nb) 今(Nd) 將(D) 執行(VC) 安樂死(Na) ,(COMMACATEGORY) 卻(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nd) 前(Ng) 遭(P) 緯來(Nb) 體育台(Na) 封殺(VC) ,(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪裡(Ncd) 得罪到(VC) 電視台(Nc) 。(PERIODCATEGORY)
NerToken(word='傅達仁', ner='PERSON', idx=(0, 3))
NerToken(word='20年', ner='DATE', idx=(18, 21))
NerToken(word='緯來體育台', ner='ORG', idx=(23, 28))
美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。
美國(Nc) 參議院(Nc) 針對(P) 今天(Nd) 總統(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 勞工部長(Na) 趙小蘭(Nb) 展開(VC) 認可(VC) 聽證會(Na) ,(COMMACATEGORY) 預料(VE) 她(Nh) 將(D) 會(D) 很(Dfa) 順利(VH) 通過(VC) 參議院(Nc) 支持(VC) ,(COMMACATEGORY) 成為(VG) 該(Nes) 國(Nc) 有史以來(D) 第一(Neu) 位(Nf) 的(DE) 華裔(Na) 女性(Na) 內閣(Na) 成員(Na) 。(PERIODCATEGORY)
NerToken(word='美國參議院', ner='ORG', idx=(0, 5))
NerToken(word='今天', ner='LOC', idx=(7, 9))
NerToken(word='布什', ner='PERSON', idx=(11, 13))
NerToken(word='勞工部長', ner='ORG', idx=(17, 21))
NerToken(word='趙小蘭', ner='PERSON', idx=(21, 24))
NerToken(word='認可聽證會', ner='EVENT', idx=(26, 31))
NerToken(word='參議院', ner='ORG', idx=(42, 45))
NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
NerToken(word='華裔', ner='NORP', idx=(60, 62))
空白 也是可以的~
空白(VH) (WHITESPACE) 也(D) 是(SHI) 可以(VH) 的(T) ~(FW)
NLP Tools Performance¶
CKIP Transformers v.s. Monpa & Jeiba¶
Level |
Tool |
WS (F1) |
POS (Acc) |
WS+POS (F1) |
NER (F1) |
---|---|---|---|---|---|
3 |
CKIP BERT Base |
97.60% |
95.67% |
94.19% |
81.18% |
2 |
CKIP ALBERT Base |
97.33% |
95.30% |
93.52% |
79.47% |
1 |
CKIP ALBERT Tiny |
96.66% |
94.48% |
92.25% |
71.17% |
– |
Monpa† |
92.58% |
– |
83.88% |
– |
– |
Jeiba |
81.18% |
– |
– |
– |
CKIP Transformers v.s. CkipTagger¶
Level |
Tool |
WS (F1) |
POS (Acc) |
WS+POS (F1) |
NER (F1) |
---|---|---|---|---|---|
3 |
CKIP BERT Base |
97.84% |
96.46% |
94.91% |
79.20% |
– |
CkipTagger |
97.33% |
97.20% |
94.75% |
77.87% |
ckip_transformers package¶
The CKIP Transformers.
Subpackages
ckip_transformers.nlp package¶
This module provides the CKIP Transformers NLP drivers.
Submodules
ckip_transformers.nlp.driver module¶
This module implements the CKIP Transformers NLP drivers.
- class ckip_transformers.nlp.driver.CkipWordSegmenter(level: int = 3, **kwargs)[source]¶
Bases:
ckip_transformers.nlp.util.CkipTokenClassification
The word segmentation driver.
- Parameters
level (
str
optional, defaults to 3, must be 1—3) – The model level. The higher the level is, the more accurate and slower the model is.model_name (
str
optional, overwrites level) – The pretrained model name (e.g.'ckiplab/bert-base-chinese-ws'
).device (
int
, optional, defaults to -1,) – Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
- __call__(input_text: List[str], *, use_delim: bool = False, **kwargs) → List[List[str]][source]¶
Call the driver.
- Parameters
input_text (
List[str]
) – The input sentences. Each sentence is a string.use_delim (
bool
, optional, defaults to False) – Segment sentence (internally) usingdelim_set
.delim_set (str, optional, defaults to
',,。::;;!!??'
) – Used for sentence segmentation ifuse_delim=True
.batch_size (
int
, optional, defaults to 256) – The size of mini-batch.max_length (
int
, optional) – The maximum length of the sentence, must not longer then the maximum sequence length for this model (i.e.tokenizer.model_max_length
).show_progress (
int
, optional, defaults to True) – Show progress bar.
- Returns
List[List[NerToken]]
– A list of list of words (str
).
- class ckip_transformers.nlp.driver.CkipPosTagger(level: int = 3, **kwargs)[source]¶
Bases:
ckip_transformers.nlp.util.CkipTokenClassification
The part-of-speech tagging driver.
- Parameters
level (
str
optional, defaults to 3, must be 1—3) – The model level. The higher the level is, the more accurate and slower the model is.model_name (
str
optional, overwrites level) – The pretrained model name (e.g.'ckiplab/bert-base-chinese-pos'
).device (
int
, optional, defaults to -1,) – Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
- __call__(input_text: List[List[str]], *, use_delim: bool = True, **kwargs) → List[List[str]][source]¶
Call the driver.
- Parameters
input_text (
List[List[str]]
) – The input sentences. Each sentence is a list of strings (words).use_delim (
bool
, optional, defaults to True) – Segment sentence (internally) usingdelim_set
.delim_set (str, optional, defaults to
',,。::;;!!??'
) – Used for sentence segmentation ifuse_delim=True
.batch_size (
int
, optional, defaults to 256) – The size of mini-batch.max_length (
int
, optional) – The maximum length of the sentence, must not longer then the maximum sequence length for this model (i.e.tokenizer.model_max_length
).show_progress (
int
, optional, defaults to True) – Show progress bar.
- Returns
List[List[NerToken]]
– A list of list of POS tags (str
).
- class ckip_transformers.nlp.driver.CkipNerChunker(level: int = 3, **kwargs)[source]¶
Bases:
ckip_transformers.nlp.util.CkipTokenClassification
The named-entity recognition driver.
- Parameters
level (
str
optional, defaults to 3, must be 1—3) – The model level. The higher the level is, the more accurate and slower the model is.model_name (
str
optional, overwrites level) – The pretrained model name (e.g.'ckiplab/bert-base-chinese-ner'
).device (
int
, optional, defaults to -1,) – Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
- __call__(input_text: List[str], *, use_delim: bool = False, **kwargs) → List[List[ckip_transformers.nlp.util.NerToken]][source]¶
Call the driver.
- Parameters
input_text (
List[str]
) – The input sentences. Each sentence is a string or a list or string (words).use_delim (
bool
, optional, defaults to False) – Segment sentence (internally) usingdelim_set
.delim_set (str, optional, defaults to
',,。::;;!!??'
) – Used for sentence segmentation ifuse_delim=True
.batch_size (
int
, optional, defaults to 256) – The size of mini-batch.max_length (
int
, optional) – The maximum length of the sentence, must not longer then the maximum sequence length for this model (i.e.tokenizer.model_max_length
).show_progress (
int
, optional, defaults to True) – Show progress bar.
- Returns
List[List[NerToken]]
– A list of list of entities (NerToken
).
ckip_transformers.nlp.util module¶
This module implements the utilities for CKIP Transformers NLP drivers.
- class ckip_transformers.nlp.util.CkipTokenClassification(model_name: str, tokenizer_name: Optional[str] = None, *, device: int = - 1)[source]¶
Bases:
object
The base class for token classification task.
- Parameters
model_name (
str
) – The pretrained model name (e.g.'ckiplab/bert-base-chinese-ws'
).tokenizer_name (
str
, optional, defaults to model_name) – The pretrained tokenizer name (e.g.'bert-base-chinese'
).device (
int
, optional, defaults to -1,) – Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
- __call__(input_text: Union[List[str], List[List[str]]], *, use_delim: bool = False, delim_set: Optional[str] = ',,。::;;!!??', batch_size: int = 256, max_length: Optional[int] = None, show_progress: bool = True)[source]¶
Call the driver.
- Parameters
input_text (
List[str]
orList[List[str]]
) – The input sentences. Each sentence is a string or a list of string.use_delim (
bool
, optional, defaults to False) – Segment sentence (internally) usingdelim_set
.delim_set (str, optional, defaults to
',,。::;;!!??'
) – Used for sentence segmentation ifuse_delim=True
.batch_size (
int
, optional, defaults to 256) – The size of mini-batch.max_length (
int
, optional) – The maximum length of the sentence, must not longer then the maximum sequence length for this model (i.e.tokenizer.model_max_length
).show_progress (
int
, optional, defaults to True) – Show progress bar.
- class ckip_transformers.nlp.util.NerToken(word: str, ner: str, idx: Tuple[int, int])[source]¶
Bases:
tuple
A named-entity recognition token.
- property word¶
str
, the token word.
- property ner¶
str
, the NER-tag.
- property idx¶
Tuple[int, int]
, the starting / ending index in the sentence.
- __getnewargs__()¶
Return self as a plain tuple. Used by copy and pickle.
- static __new__(_cls, word: str, ner: str, idx: Tuple[int, int])¶
Create new instance of NerToken(word, ner, idx)
- __repr__()¶
Return a nicely formatted representation string