目录:NYT-Wiki数据集分析
【数据集分析】NYT-Wiki关系抽取数据集分析(一)—— 理解单条实例 【数据集分析】NYT-Wiki关系抽取数据集分析(二)—— 统计类别和实例数 【数据集分析】NYT-Wiki关系抽取数据集分析(三)—— 绘制Relation分布图
第二节,获得了三个子集的描述:类别数和实例数。
本节介绍绘制数据集的Relation分布图: 图中横坐标是不同的Relation,纵坐标是每个Relation的Instances数。
1. 查看数据分布
查看数据分布主要包括三步:
获取数据集的每条数据(json格式)建立一个词频dict,格式为: {"class name 1": count1, "class name 2":count2, ...}使用matplotlib进行图像绘制
2. 代码
import matplotlib
as mpl
import matplotlib
.pyplot
as plt
def plot_relation_distribution(dataset_path
):
rel_fre_dict
= {}
with open(dataset_path
, 'r', encoding
= 'utf-8') as f
:
for line
in f
.readlines
():
line
= json
.loads
(line
)
if line
['relation'] not in rel_fre_dict
.keys
():
rel_fre_dict
[line
['relation']] = 1
else:
rel_fre_dict
[line
['relation']] += 1
x
= []
y
= []
width
= []
sorted_rel_fre_dict
= sorted(rel_fre_dict
.items
(), key
=lambda kv
: (-kv
[1]))
for i
in sorted_rel_fre_dict
:
x
.append
(i
[0])
y
.append
(i
[1])
width
.append
(1)
plt
.figure
(figsize
= [40, 10])
plt
.bar
(x
,y
,width
, align
='center', alpha
=0.5, clip_on
= True)
plt
.ylim
([0, 5000])
plt
.xlabel
("relation name")
plt
.ylabel
("# of relation")
plt
.title
(str(dataset_path
)+' relation number statistic')
plt
.tick_params
(axis
='x', colors
='red', length
=13, width
=3, rotation
=90)
plt
.savefig
(str(dataset_path
)+'.png')
plot_relation_distribution
(train_path
)
plot_relation_distribution
(valid_path
)
plot_relation_distribution
(test_path
)