CKIP Transformers¶

This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).

這個專案提供了繁體中文的 transformers 模型（包含 ALBERT、BERT、GPT2）及自然語言處理工具（包含斷詞、詞性標記、實體辨識）。

Git¶

https://github.com/ckiplab/ckip-transformers

PyPI¶

https://pypi.org/project/ckip-transformers

Documentation¶

https://ckip-transformers.readthedocs.io

Demo¶

https://ckip.iis.sinica.edu.tw/service/transformers

Contributers¶

Mu Yang at CKIP (Author & Maintainer).
Wei-Yun Ma at CKIP (Maintainer).

Models¶

You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.

您可於 https://huggingface.co/ckiplab/ 下載預訓練的模型。

Language Models
- ALBERT Tiny: ckiplab/albert-tiny-chinese
- ALBERT Base: ckiplab/albert-base-chinese
- BERT Base: ckiplab/bert-base-chinese
- GPT2 Base: ckiplab/gpt2-base-chinese
NLP Task Models
- ALBERT Tiny — Word Segmentation: ckiplab/albert-tiny-chinese-ws
- ALBERT Tiny — Part-of-Speech Tagging: ckiplab/albert-tiny-chinese-pos
- ALBERT Tiny — Named-Entity Recognition: ckiplab/albert-tiny-chinese-ner
- ALBERT Base — Word Segmentation: ckiplab/albert-base-chinese-ws
- ALBERT Base — Part-of-Speech Tagging: ckiplab/albert-base-chinese-pos
- ALBERT Base — Named-Entity Recognition: ckiplab/albert-base-chinese-ner
- BERT Base — Word Segmentation: ckiplab/bert-base-chinese-ws
- BERT Base — Part-of-Speech Tagging: ckiplab/bert-base-chinese-pos
- BERT Base — Named-Entity Recognition: ckiplab/bert-base-chinese-ner

Model Usage¶

You may use our model directly from the HuggingFace’s transformers library

您可直接透過 HuggingFace’s transformers 套件使用我們的模型

pip install -U transformers

Please use BertTokenizerFast as tokenizer, and replace ckiplab/albert-tiny-chinese and ckiplab/albert-tiny-chinese-ws by any model you need in the following example.

請使用內建的 BertTokenizerFast，並將以下範例中的 ckiplab/albert-tiny-chinese 與 ckiplab/albert-tiny-chinese-ws 替換成任何您要使用的模型名稱。

from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForTokenClassification,
)

# language model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above

# nlp task model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above

Model Fine-Tunning¶

To fine tunning our model on your own datasets, please refer the the following example from HuggingFace’s transformers.

您可參考以下的範例去微調我們的模型於您自己的資料集。

Remember to set --tokenizer_name bert-base-chinese in order to use Chinese tokenizer.

記得設置 --tokenizer_name bert-base-chinese 以正確的使用中文的 tokenizer。

python run_mlm.py \
   --model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

python run_ner.py \
   --model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

Performance¶

The following is a performance comparison between our model and other models.
The results are tested on a traditional Chinese corpus.
以下是我們的模型與其他的模型之性能比較。
各個任務皆測試於繁體中文的測試集。

Model	Perplexity†	WS (F1)‡	POS (ACC)‡	NER (F1)‡
ckiplab/albert-tiny-chinese	4.80	96.66%	94.48%	71.17%
ckiplab/albert-base-chinese	2.65	97.33%	95.30%	79.47%
ckiplab/bert-base-chinese	1.88	97.60%	95.67%	81.18%
ckiplab/gpt2-base-chinese	14.40	–	–	–

voidful/albert_chinese_tiny	74.93	–	–	–
voidful/albert_chinese_base	22.34	–	–	–
bert-base-chinese	2.53	–	–	–

† Perplexity; the smaller the better.
† 混淆度；數字越小越好。
‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.
‡ WS: 斷詞；POS: 詞性標記；NER: 實體辨識；數字越大越好。

NLP Tools¶

The package also provide the following NLP tools.

我們的套件也提供了以下的自然語言處理工具。

(WS) Word Segmentation 斷詞
(POS) Part-of-Speech Tagging 詞性標記
(NER) Named Entity Recognition 實體辨識

Installation¶

pip install -U ckip-transformers

Requirements:

NLP Tools Usage¶

See here for API details.

詳細的 API 請參見此處。

The complete script of this example is https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py.

以下的範例的完整檔案可參見 https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py 。

1. Import module¶

from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker

2. Load models¶

We provide three levels (1–3) of drivers. Level 1 if the fastest, and level 3 (default) is the most accurate.

我們的工具分為三個等級（1—3）。等級一最快，等級三（預設值）最精準。

# Initialize drivers
ws_driver  = CkipWordSegmenter(level=3)
pos_driver = CkipPosTagger(level=3)
ner_driver = CkipNerChunker(level=3)

To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.

可於宣告斷詞等工具時指定 device 以使用 GPU，設為 -1 （預設值）代表不使用 GPU。

# Use CPU
ws_driver = CkipWordSegmenter(device=-1)

# Use GPU:0
ws_driver = CkipWordSegmenter(device=0)

3. Run pipeline¶

The input for word segmentation and named-entity recognition must be a list of sentences.
The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).
斷詞與實體辨識的輸入必須是 list of sentences。
詞性標記的輸入必須是 list of list of words。

# Input text
text = [
   '傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。',
   '美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會，預料她將會很順利通過參議院支持，成為該國有史以來第一位的華裔女性內閣成員。',
   '空白 也是可以的～',
]

# Run pipeline
ws  = ws_driver(text)
pos = pos_driver(ws)
ner = ner_driver(text)

The POS driver will automatically segment the sentence internally using there characters '，,。：:；;！!？?' while running the model. (The output sentences will be concatenated back.) You may set delim_set to any characters you want.
You may set use_delim=False to disable this feature, or set use_delim=True in WS and NER driver to enable this feature.
詞性標記工具會自動用 '，,。：:；;！!？?' 等字元在執行模型前切割句子（輸出的句子會自動接回）。可設定 delim_set 參數已使用別的字元做切割。
另外可指定 use_delim=False 已停用此功能，或於斷詞、實體辨識時指定 use_delim=False 已啟用此功能。

# Enable sentence segmentation
ws  = ws_driver(text, use_delim=True)
ner = ner_driver(text, use_delim=True)

# Disable sentence segmentation
pos = pos_driver(ws, use_delim=False)

# Use new line characters and tabs for sentence segmentation
pos = pos_driver(ws, delim_set='\n\t')

You may specify batch_size and max_length to better utilize you machine resources.

您亦可設置 batch_size 與 max_length 以更完美的利用您的機器資源。

# Sets the batch size and maximum sentence length
ws = ws_driver(text, batch_size=256, max_length=512)

4. Show results¶

# Pack word segmentation and part-of-speech results
def pack_ws_pos_sentece(sentence_ws, sentence_pos):
   assert len(sentence_ws) == len(sentence_pos)
   res = []
   for word_ws, word_pos in zip(sentence_ws, sentence_pos):
      res.append(f'{word_ws}({word_pos})')
   return '\u3000'.join(res)

# Show results
for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):
   print(sentence)
   print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
   for entity in sentence_ner:
      print(entity)
   print()

傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。
傅達仁(Nb)　今(Nd)　將(D)　執行(VC)　安樂死(Na)　，(COMMACATEGORY)　卻(D)　突然(D)　爆出(VJ)　自己(Nh)　20(Neu)　年(Nd)　前(Ng)　遭(P)　緯來(Nb)　體育台(Na)　封殺(VC)　，(COMMACATEGORY)　他(Nh)　不(D)　懂(VK)　自己(Nh)　哪裡(Ncd)　得罪到(VC)　電視台(Nc)　。(PERIODCATEGORY)
NerToken(word='傅達仁', ner='PERSON', idx=(0, 3))
NerToken(word='20年', ner='DATE', idx=(18, 21))
NerToken(word='緯來體育台', ner='ORG', idx=(23, 28))

美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會，預料她將會很順利通過參議院支持，成為該國有史以來第一位的華裔女性內閣成員。
美國(Nc)　參議院(Nc)　針對(P)　今天(Nd)　總統(Na)　布什(Nb)　所(D)　提名(VC)　的(DE)　勞工部長(Na)　趙小蘭(Nb)　展開(VC)　認可(VC)　聽證會(Na)　，(COMMACATEGORY)　預料(VE)　她(Nh)　將(D)　會(D)　很(Dfa)　順利(VH)　通過(VC)　參議院(Nc)　支持(VC)　，(COMMACATEGORY)　成為(VG)　該(Nes)　國(Nc)　有史以來(D)　第一(Neu)　位(Nf)　的(DE)　華裔(Na)　女性(Na)　內閣(Na)　成員(Na)　。(PERIODCATEGORY)
NerToken(word='美國參議院', ner='ORG', idx=(0, 5))
NerToken(word='今天', ner='LOC', idx=(7, 9))
NerToken(word='布什', ner='PERSON', idx=(11, 13))
NerToken(word='勞工部長', ner='ORG', idx=(17, 21))
NerToken(word='趙小蘭', ner='PERSON', idx=(21, 24))
NerToken(word='認可聽證會', ner='EVENT', idx=(26, 31))
NerToken(word='參議院', ner='ORG', idx=(42, 45))
NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
NerToken(word='華裔', ner='NORP', idx=(60, 62))

空白 也是可以的～
空白(VH)　 (WHITESPACE)　也(D)　是(SHI)　可以(VH)　的(T)　～(FW)

Performance¶

The following is a performance comparison between our tool and other tools.

以下是我們的工具與其他的工具之性能比較。