文本分类数据集汇总

    技术2022-07-11  80

    统计了下载到的文本分类数据集信息,汇总成表格如下(时间:2020.7.1):

    DatasetClassesTypeSamplesBest MethodPerformanceAG News4TopicTrain:120000 Test: 7600XLNetError: 4.45Dbpedia14TopicTrain: 560000 Test: 70000XLNetError: 0.6TREC-66QuestionTrain: 5452 Test: 500USE_T+CNNError: 1.93TREC-5050QuestionTrain: 5452 Test: 500RulesError: 2.820NEWS20Topic20,000SGCAcc: 88.5IMDb2SentimentTrain: 25,000 Test: 25,000XLNetAcc: 96.8Yahoo! Answers10QuestionTrain: 1,400,000 Test: 60,000BERT-ITPT-FiTAcc: 77.62R88TopicTrain: 5,485 Test: 2,189NABoE-fullAcc: 97.9Ohsumed23疾病分类50,216SGCNAcc: 68.5Sogou News5TopicTrain: 450,000 Test: 60,000BERT-ITPT-FiTAcc: 98.07Amazon-22评分1-2: negative 4-5: positiveper class Train: 1,800,000 Test: 200,000XLNetError: 2.11Amazon-55用户评分1-5per class Train: 600,000 Test: 130,000XLNetError: 31.67Yelp-221-2: negative 4-5: positiveper class Train: 130,000 Test: 10,000XLNetAcc:98.63Yelp-55用户评分1-5per class Train:130,000 Test: 10,000HANNNAcc: 73.28Reuters-2157890TopicTrain:7769 Test: 3019MPAD-pathAcc: 97.44Cora7论文分类:如:遗传算法2708ACNetAcc: 83.5BBCSports5Topic737MPAD-pathAcc: 99.59WOS-1196735, 7父类论文类别: 如: CS->computer graphics11967RMDLAcc:91.59WOS-46985134, 7父类论文类别: 如: CS->computer graphics46985RMDLAcc:82.42WOS-1196711, 3父类论文类别: 如: CS->computer graphics5736RMDLAcc:93.57

    未能下载的数据集:DODF Data,MVICTOR(type),RCV1,TRAC2-Benghali. Task 2., TRAC2-English. Task2.,AffCon 2020 Emotion Detection,IMDb-M,AAPD,Yelp-14,Reuters En-De,Reuters De-En,MPQA,HoC

    参考链接: Text Classification Document Classification

    Processed: 0.013, SQL: 9