CKIP Transformers

This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
這個專案提供了繁體中文的 transformers 模型(包含 ALBERT、BERT、GPT2)及自然語言處理工具(包含斷詞、詞性標記、實體辨識)。

Git

PyPI

Documentation

Demo

Contributers

Models

You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.
您可於 https://huggingface.co/ckiplab/ 下載預訓練的模型。

Model Usage

You may use our model directly from the HuggingFace’s transformers library.
您可直接透過 HuggingFace’s transformers 套件使用我們的模型。
pip install -U transformers
Please use BertTokenizerFast as tokenizer, and replace ckiplab/albert-tiny-chinese and ckiplab/albert-tiny-chinese-ws by any model you need in the following example.
請使用內建的 BertTokenizerFast,並將以下範例中的 ckiplab/albert-tiny-chineseckiplab/albert-tiny-chinese-ws 替換成任何您要使用的模型名稱。
from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForCausalLM,
   AutoModelForTokenClassification,
)

# masked language model (ALBERT, BERT)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above

# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

# nlp task model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above

Model Fine-Tunning

To fine tunning our model on your own datasets, please refer to the following example from HuggingFace’s transformers.
您可參考以下的範例去微調我們的模型於您自己的資料集。
Remember to set --tokenizer_name bert-base-chinese in order to use Chinese tokenizer.
記得設置 --tokenizer_name bert-base-chinese 以正確的使用中文的 tokenizer。
python run_mlm.py \
   --model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

python run_ner.py \
   --model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
   --tokenizer_name bert-base-chinese \
   ...

Model Performance

The following is a performance comparison between our model and other models.
The results are tested on a traditional Chinese corpus.
以下是我們的模型與其他的模型之性能比較。
各個任務皆測試於繁體中文的測試集。

Model

#Parameters

Perplexity†

WS (F1)‡

POS (ACC)‡

NER (F1)‡

ckiplab/albert-tiny-chinese

4M

4.80

96.66%

94.48%

71.17%

ckiplab/albert-base-chinese

11M

2.65

97.33%

95.30%

79.47%

ckiplab/bert-tiny-chinese

12M

8.07

96.98%

95.11%

74.21%

ckiplab/bert-base-chinese

102M

1.88

97.60%

95.67%

81.18%

ckiplab/gpt2-tiny-chinese

4M

16.94

ckiplab/gpt2-base-chinese

102M

8.36

voidful/albert_chinese_tiny

4M

74.93

voidful/albert_chinese_base

11M

22.34

bert-base-chinese

102M

2.53

† Perplexity; the smaller the better.
† 混淆度;數字越小越好。
‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.
‡ WS: 斷詞;POS: 詞性標記;NER: 實體辨識;數字越大越好。

Training Corpus

The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.
以上的語言模型訓練於 ZhWiki 與 CNA 資料集上;斷詞(WS)與詞性標記(POS)任務模型訓練於 ASBC 資料集上;實體辨識(NER)任務模型訓練於 OntoNotes 資料集上。
Here is a summary of each corpus.
以下是各個資料集的一覽表。

Dataset

#Documents

#Lines

#Characters

Line Type

CNA

2,559,520

13,532,445

1,219,029,974

Paragraph

ZhWiki

1,106,783

5,918,975

495,446,829

Paragraph

ASBC

19,247

1,395,949

17,572,374

Clause

OntoNotes

1,911

48,067

1,568,491

Sentence

Here is the dataset split used for language models.
以下是用於訓練語言模型的資料集切割。

CNA+ZhWiki

#Documents

#Lines

#Characters

Train

3,606,303

18,986,238

4,347,517,682

Dev

30,000

148,077

32,888,978

Test

30,000

151,241

35,216,818

Here is the dataset split used for word segmentation and part-of-speech tagging models.
以下是用於訓練斷詞及詞性標記模型的資料集切割。

ASBC

#Documents

#Lines

#Words

#Characters

Train

15,247

1,183,260

9,480,899

14,724,250

Dev

2,000

52,677

448,964

741,323

Test

2,000

160,012

1,315,129

2,106,799

Here is the dataset split used for word segmentation and named entity recognition models.
以下是用於訓練實體辨識模型的資料集切割。

OntoNotes

#Documents

#Lines

#Characters

#Named-Entities

Train

1,511

43,362

1,367,658

68,947

Dev

200

2,304

93,535

7,186

Test

200

2,401

107,298

6,977

NLP Tools

The package also provide the following NLP tools.
我們的套件也提供了以下的自然語言處理工具。
  • (WS) Word Segmentation 斷詞

  • (POS) Part-of-Speech Tagging 詞性標記

  • (NER) Named Entity Recognition 實體辨識

Installation

pip install -U ckip-transformers

Requirements:

NLP Tools Usage

See here for API details.
詳細的 API 請參見 此處

1. Import module

from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker

2. Load models

We provide several pretrained models for the NLP tools.
我們提供了一些適用於自然語言工具的預訓練的模型。
# Initialize drivers
ws_driver  = CkipWordSegmenter(model="bert-base")
pos_driver = CkipPosTagger(model="bert-base")
ner_driver = CkipNerChunker(model="bert-base")
One may also load their own checkpoints using our drivers.
也可以運用我們的工具於自己訓練的模型上。
# Initialize drivers with custom checkpoints
ws_driver  = CkipWordSegmenter(model_name="path_to_your_model")
pos_driver = CkipPosTagger(model_name="path_to_your_model")
ner_driver = CkipNerChunker(model_name="path_to_your_model")
To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.
可於宣告斷詞等工具時指定 device 以使用 GPU,設為 -1 (預設值)代表不使用 GPU。
# Use CPU
ws_driver = CkipWordSegmenter(device=-1)

# Use GPU:0
ws_driver = CkipWordSegmenter(device=0)

3. Run pipeline

The input for word segmentation and named-entity recognition must be a list of sentences.
The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).
斷詞與實體辨識的輸入必須是 list of sentences。
詞性標記的輸入必須是 list of list of words。
# Input text
text = [
   "傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。",
   "美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。",
   "空白 也是可以的~",
]

# Run pipeline
ws  = ws_driver(text)
pos = pos_driver(ws)
ner = ner_driver(text)
The POS driver will automatically segment the sentence internally using there characters ',,。::;;!!??' while running the model. (The output sentences will be concatenated back.) You may set delim_set to any characters you want.
You may set use_delim=False to disable this feature, or set use_delim=True in WS and NER driver to enable this feature.
詞性標記工具會自動用 ',,。::;;!!??' 等字元在執行模型前切割句子(輸出的句子會自動接回)。可設定 delim_set 參數使用別的字元做切割。
另外可指定 use_delim=False 已停用此功能,或於斷詞、實體辨識時指定 use_delim=True 已啟用此功能。
# Enable sentence segmentation
ws  = ws_driver(text, use_delim=True)
ner = ner_driver(text, use_delim=True)

# Disable sentence segmentation
pos = pos_driver(ws, use_delim=False)

# Use new line characters and tabs for sentence segmentation
pos = pos_driver(ws, delim_set='\n\t')
You may specify batch_size and max_length to better utilize you machine resources.
您亦可設置 batch_sizemax_length 以更完美的利用您的機器資源。
# Sets the batch size and maximum sentence length
ws = ws_driver(text, batch_size=256, max_length=128)

4. Show results

# Pack word segmentation and part-of-speech results
def pack_ws_pos_sentece(sentence_ws, sentence_pos):
   assert len(sentence_ws) == len(sentence_pos)
   res = []
   for word_ws, word_pos in zip(sentence_ws, sentence_pos):
      res.append(f"{word_ws}({word_pos})")
   return "\u3000".join(res)

# Show results
for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):
   print(sentence)
   print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
   for entity in sentence_ner:
      print(entity)
   print()
傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
傅達仁(Nb) 今(Nd) 將(D) 執行(VC) 安樂死(Na) ,(COMMACATEGORY) 卻(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nd) 前(Ng) 遭(P) 緯來(Nb) 體育台(Na) 封殺(VC) ,(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪裡(Ncd) 得罪到(VC) 電視台(Nc) 。(PERIODCATEGORY)
NerToken(word='傅達仁', ner='PERSON', idx=(0, 3))
NerToken(word='20年', ner='DATE', idx=(18, 21))
NerToken(word='緯來體育台', ner='ORG', idx=(23, 28))

美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。
美國(Nc) 參議院(Nc) 針對(P) 今天(Nd) 總統(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 勞工部長(Na) 趙小蘭(Nb) 展開(VC) 認可(VC) 聽證會(Na) ,(COMMACATEGORY) 預料(VE) 她(Nh) 將(D) 會(D) 很(Dfa) 順利(VH) 通過(VC) 參議院(Nc) 支持(VC) ,(COMMACATEGORY) 成為(VG) 該(Nes) 國(Nc) 有史以來(D) 第一(Neu) 位(Nf) 的(DE) 華裔(Na) 女性(Na) 內閣(Na) 成員(Na) 。(PERIODCATEGORY)
NerToken(word='美國參議院', ner='ORG', idx=(0, 5))
NerToken(word='今天', ner='LOC', idx=(7, 9))
NerToken(word='布什', ner='PERSON', idx=(11, 13))
NerToken(word='勞工部長', ner='ORG', idx=(17, 21))
NerToken(word='趙小蘭', ner='PERSON', idx=(21, 24))
NerToken(word='認可聽證會', ner='EVENT', idx=(26, 31))
NerToken(word='參議院', ner='ORG', idx=(42, 45))
NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
NerToken(word='華裔', ner='NORP', idx=(60, 62))

空白 也是可以的~
空白(VH)  (WHITESPACE) 也(D) 是(SHI) 可以(VH) 的(T) ~(FW)

NLP Tools Performance

The following is a performance comparison between our tool and other tools.
以下是我們的工具與其他的工具之性能比較。

CKIP Transformers v.s. Monpa & Jeiba

Tool

WS (F1)

POS (Acc)

WS+POS (F1)

NER (F1)

CKIP BERT Base

97.60%

95.67%

94.19%

81.18%

CKIP ALBERT Base

97.33%

95.30%

93.52%

79.47%

CKIP BERT Tiny

96.98%

95.08%

93.13%

74.20%

CKIP ALBERT Tiny

96.66%

94.48%

92.25%

71.17%

Monpa†

92.58%

83.88%

Jeiba

81.18%

† Monpa provides only 3 types of tags in NER.
† Monpa 的實體辨識僅提供三種標記而已。

CKIP Transformers v.s. CkipTagger

The following results are tested on a different dataset.†
以下實驗在另一個資料集測試。†

Tool

WS (F1)

POS (Acc)

WS+POS (F1)

NER (F1)

CKIP BERT Base

97.84%

96.46%

94.91%

79.20%

CkipTagger

97.33%

97.20%

94.75%

77.87%

† Here we retrained/tested our BERT model using the same dataset with CkipTagger.
† 我們重新訓練/測試我們的 BERT 模型於跟 CkipTagger 相同的資料集。

License

GPL-3.0

Copyright (c) 2023 CKIP Lab under the GPL-3.0 License.