Spooky author identification

kaggle의 Spooky Author Identification를 연습해보겠습니다.

Spooky Author Identification

  • 공포이야기에 있는 문장의 단어를 분석하여 작가를 예측
  • 제출: id + 3명의 작가에 대한 확률 => 3개의 클래스로 텍스트 분류
import pandas as pd
import numpy as np

1. 데이터 불러오기

id text author
0 id26305 This process, however, afforded me no means of... EAP
1 id17569 It never once occurred to me that the fumbling... HPL
2 id11008 In his left hand was a gold snuff box, from wh... EAP
3 id27763 How lovely is spring As we looked from Windsor... MWS
4 id12958 Finding nothing else, not even gold, the Super... HPL
(19579, 3)
(8392, 2)
id text
0 id02310 Still, as I urged our leaving Ireland with suc...
1 id24541 If a fire wanted fanning, it could readily be ...
2 id00134 And when they had broken down the frail door t...
3 id27757 While I was thinking how I should possibly man...
4 id04081 I am not sure to what limit his knowledge may ...


import matplotlib.pyplot as plt
from wordcloud import WordCloud
  • 작가에 해당하는 단어
<matplotlib.axes._subplots.AxesSubplot at 0x219fb0a59a0>


  • 문장의 길이 알아보기
0    231
1     71
2    200
3    206
4    174
Name: text, dtype: int64
plt.hist(data_length, bins= 20, range=[0,500], color="r", alpha=0.3)


  • 한문장에 대략 몇개의 단어가 들어가 있는지
data_split_length= train.text.apply(lambda x:len(x.split(" ")))
0    41
1    14
2    36
3    34
4    27
Name: text, dtype: int64
plt.hist(data_split_length, bins=10, range=[0,100], color='b', alpha=0.5)


  • 워드클라우드: 전체 텍스트에서 많이 사용되는 단어들
cloud= WordCloud(width=400, height=200).generate(" ".join(train.text))
(-0.5, 399.5, 199.5, -0.5)


  • 워드클라우드 저자별로 많이 사용되는 단어들


cloud= WordCloud(width=400, height=200).generate(" ".join(train[train['author']=='HPL']['text']))
(-0.5, 399.5, 199.5, -0.5)



cloud= WordCloud(width=400, height=200).generate(" ".join(train[train['author']=='MWS']['text']))
(-0.5, 399.5, 199.5, -0.5)



cloud= WordCloud(width=400, height=200).generate(" ".join(train[train['author']=='EAP']['text']))
(-0.5, 399.5, 199.5, -0.5)


2. 데이터 전처리

  • 작가의 이름을 0,1,2로 변환
from sklearn import preprocessing
from keras.preprocessing import  sequence, text
lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(train.author)
array([0, 1, 0, 2, 1, 2, 0, 0, 0, 2])

3. 데이터셋 나누기

from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(train.text.values, y, stratify=y, random_state=42, test_size=0.3, shuffle=True)
print( x_train.shape)
print( x_valid.shape)

print( y_train.shape)
print( y_valid.shape)
  • 원핫인코딩
from keras.utils import np_utils

ytrain_enc = np_utils.to_categorical(y_train) 
yvalid_enc = np_utils.to_categorical(y_valid)

4. keras로 모델만들기

  • 5000개의 단어 사용
  • 최대 길이 60
  • padding을 통해 길이 맞추기
  • texts_to_sequences() 메서드를 이용해서 이러한 단어들을 시퀀스의 형태로 변환(word_index를 통해 텍스트 단어의 순서를 나열한 것을 각 문장에 맞게 변환)
from keras.preprocessing.text import Tokenizer


token= Tokenizer(num_words=num_words)  #, oov_token은 토큰화 되지 않은 단어에 대해 특수한 값으로 변환
token.fit_on_texts(list(x_train) + list(x_valid))
word_index = token.word_index

xtrain_seq = token.texts_to_sequences(x_train)
xvalid_seq = token.texts_to_sequences(x_valid)
test_seq = token.texts_to_sequences(test.text.values)

# zero pad the sequences

xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)
test_pad = sequence.pad_sequences(test_seq , maxlen=max_len)
  • stopwords를 직접 지정
vect = CountVectorizer(stop_words=["and", "is", "the", "this", 'υπνος','οἶδα']).fit((list(x_train) + list(x_valid)))
불용어 불러오기

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\uos\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
  • stopword를 다운 받은 english에 포함되면 제거
vect = CountVectorizer(stop_words="english").fit((list(x_train) + list(x_valid)))
{'came': 2992,
 'embodied': 7215,
 'image': 10905,
 'fondest': 8817,
 'dreams': 6688,
 'creator': 4952,
 'abhor': 32,
 'hope': 10581,
 'gather': 9347,
 'fellow': 8409,
 'creatures': 4954,
 'owe': 15372,
 'remember': 18062,
 'trees': 22437,
 'benches': 2059,
 'similar': 19859,
 'especially': 7676,
 'way': 24106,
 'method': 13747,
 'true': 22563,
 'wretchedness': 24583,
 'ultimate': 22721,
 'woe': 24461,
 'particular': 15610,
 'diffuse': 6036,
 'barzai': 1835,

CountVectorizer()의 이외 옵션


알파벳 개수로

직접지정한 패턴
단어 토큰한


  • (2,2) : 2개의 연결된 토큰을 한단어로
  • (1,2): 1개 또는 2개의 연결된 토큰을 한단어로
vect = CountVectorizer(ngram_range=(2, 2)).fit((list(x_train) + list(x_valid)))
{'you came': 222734,
 'came the': 31603,
 'the embodied': 182073,
 'embodied image': 54545,
 'image of': 89224,
 'of my': 129305,
 'my fondest': 118572,
 'fondest dreams': 66094,
 'you my': 222939,
 'my creator': 118311,
 'creator abhor': 41641,
 'abhor me': 87,
 'me what': 111278,
 'what hope': 212948,
 'hope can': 86493,
 'can gather': 31737,
 'gather from': 70913,
 'from your': 69797,
 'your fellow': 223399,
 'fellow creatures': 63004,
 'creatures who': 41740,
 'who owe': 215732,
 'owe me': 137174,
 'me nothing': 111038,
 'well remember': 211513,
 'remember it': 151695,
 'it had': 97057,
 'had no': 76413,
 'no trees': 122967,
 'trees nor': 199145,
 'nor benches': 123356,
 'benches nor': 24641,
 'nor anything': 123348,
 'anything similar': 14255,
 'similar within': 163511,
 'within it': 219544,
 'especially there': 56651,
 'there is': 189467,
 'is nothing': 96243,
 'nothing to': 124814,
 'to be': 195208,
 'be made': 21429,
 'made in': 107999,
 'in this': 91822,
 'this way': 192345,
 'way without': 210439,
 'without method': 219710,
 'the true': 187001,
 'true wretchedness': 199706,
 'wretchedness indeed': 221562,
 'indeed the': 92638,
 'the ultimate': 187052,
 'ultimate woe': 200747,
 'woe is': 219897,
 'is particular': 96288,
 'particular not': 138617,
 'not diffuse': 123881,
 'barzai and': 20705,
 'and atal': 7791,
 'atal went': 18698,
 'went out': 211613,
 'out of': 136563,
 'of hatheg': 128520,
 'hatheg into': 78352,
 'into the': 95154,
 'the stony': 186443,
 'stony desert': 171682,
 'desert despite': 46534,
 'despite the': 46942,
 'the prayers': 185134,
 'prayers of': 144357,
 'of peasants': 129560,
 'peasants and': 139667,
 'and talked': 12274,
 'talked of': 176612,
 'of earth': 127961,
 'earth gods': 53036,
 'gods by': 73111,
 'by their': 30933,
 'their campfires': 187827,
 'campfires at': 31656,
 'at night': 18503,
 'from my': 69519,
 'my infancy': 118754,
 'infancy was': 93107,
 'was noted': 209080,
 'noted for': 124696,
 'for the': 66873,
 'the docility': 181845,
 'docility and': 50064,
 'and humanity': 9899,
 'humanity of': 88021,
 'my disposition': 118394,
 'then the': 189244,
 'the bank': 180387,
 'bank defaulter': 20434,
 'defaulter remembered': 45284,
 'remembered the': 151748,
 'the picture': 184941,
 'picture and': 141607,
 'and suggested': 12192,
 'suggested that': 174368,
 'that it': 179046,
 'it be': 96813,
 'be viewed': 21824,
 'viewed and': 206079,
 'and filed': 9348,
 'filed for': 63892,
 'for identification': 66544,
 'identification at': 88583,
 'at police': 18543,
 'police headquarters': 143234,
 'saw that': 157093,
 'that the': 179706,
 'the moat': 184209,
 'moat was': 114207,
 'was filled': 208589,
 'filled in': 63924,
 'in and': 90244,
 'and that': 12332,
 'that some': 179596,
 'some of': 167245,
 'of the': 130574,
 'the well': 187474,
 'well known': 211486,
 'known towers': 100452,
 'towers were': 198408,
 'were demolished': 211867,
 'demolished whilst': 45939,
 'whilst new': 215189,
 'new wings': 121838,
 'wings existed': 217810,
 'existed to': 59093,
 'to confuse': 195406,
 'confuse the': 38666,
 'the beholder': 180472,
 'by this': 30939,
 'this time': 192272,
 'time his': 194613,
 'his pulse': 85462,
 'pulse was': 147071,
 'was imperceptible': 208797,
 'imperceptible and': 89761,
 'and his': 9854,
 'his breathing': 84399,
 'breathing was': 27984,
 'was stertorous': 209543,
 'stertorous and': 171106,
 'and at': 7790,
 'at intervals': 18431,
 'intervals of': 94811,
 'of half': 128499,
 'half minute': 77076,
 'and above': 7486,
 'above the': 429,
 'the nighted': 184449,
 'nighted screaming': 122215,
 'screaming of': 157955,
 'of men': 129135,
 'men and': 111993,
 'and horses': 9880,
 'horses that': 86903,
 'that dæmonic': 178722,
 'dæmonic drumming': 52549,
 'drumming rose': 51962,
 'rose to': 155024,
 'to louder': 196218,
 'louder pitch': 106993,
 'pitch whilst': 141969,
 'whilst an': 215165,
 'an ice': 6815,
 'ice cold': 88435,
 'cold wind': 36762,
 'wind of': 217510,
 'of shocking': 130206,
 'shocking sentience': 161962,
 'sentience and': 159895,
 'and deliberateness': 8672,
 'deliberateness swept': 45620,
 'swept down': 175841,
 'down from': 50873,
 'from those': 69719,
 'those forbidden': 192565,
 'forbidden heights': 67034,
 'heights and': 81370,
 'and coiled': 8311,
 'coiled about': 36670,
 'about each': 247,
 'each man': 52620,
 'man separately': 109004,
 'separately till': 159967,
 'till all': 194422,
 'all the': 4570,
 'the cohort': 181121,
 'cohort was': 36667,
 'was struggling': 209566,
 'struggling and': 172963,
 'and screaming': 11684,
 'screaming in': 157950,
 'in the': 91807,
 'the dark': 181542,
 'dark as': 43342,
 'as if': 16990,
 'if acting': 88736,
 'acting out': 1348,
 'out the': 136594,
 'the fate': 182385,
 'fate of': 61902,
 'of laocoön': 128899,
 'laocoön and': 101140,
 'his sons': 85661,
 'the arms': 180223,
 'arms stirred': 16063,
 'stirred disquietingly': 171522,
 'disquietingly the': 49332,
 'the legs': 183742,
 'legs drew': 102946,
 'drew up': 51678,
 'up and': 202974,
 'and various': 12723,
 'various muscles': 204780,
 'muscles contracted': 117574,
 'contracted in': 39746,
 'in repulsive': 91516,
 'repulsive kind': 152447,
 'kind of': 99809,
 'of writhing': 131041,
 'for arthur': 66275,
 'arthur munroe': 16465,
 'munroe was': 117471,
 'was dead': 208337,
 'alas how': 3864,
 'how great': 87426,
 'great was': 74442,
 'was the': 209648,
 'the contrast': 181292,
 'contrast between': 39794,
 'between us': 25278,
 'us he': 203683,
 'he was': 80266,
 'was alive': 207990,
 'alive to': 4057,
 'to every': 195784,
 'every new': 57812,
 'new scene': 121792,
 'scene joyful': 157569,
 'joyful when': 99020,
 'when he': 213455,
 'he saw': 80089,
 'saw the': 157094,
 'the beauties': 180453,
 'beauties of': 22121,
 'the setting': 185963,
 'setting sun': 160294,
 'sun and': 174588,
 'and more': 10614,
 'more happy': 115396,
 'happy when': 77716,
 'he beheld': 79398,
 'beheld it': 23924,
 'it rise': 97313,
 'rise and': 154169,
 'and recommence': 11396,
 'recommence new': 150395,
 'new day': 121712,
 'this adventure': 191273,
 'adventure occurred': 2154,
 'occurred near': 126559,
 'near richmond': 120672,
 'richmond in': 153844,
 'in virginia': 91948,
 'grey headed': 74927,
 'headed men': 80443,
 'men ye': 112141,
 'ye hoped': 221974,
 'hoped for': 86547,
 'for yet': 66996,
 'yet few': 222391,
 'few years': 63578,
 'years in': 222115,
 'in your': 92047,
 'your long': 223507,
 'long known': 106174,
 'known abode': 100384,
 'abode but': 176,
 'but the': 29984,
 'the lease': 183722,
 'lease is': 102400,
 'is up': 96559,
 'up you': 203162,
 'you must': 222937,
 'must remove': 117854,
 'remove children': 151881,
 'children ye': 34885,
 'ye will': 222018,
 'will never': 217224,
 'never reach': 121566,
 'reach maturity': 149423,
 'maturity even': 110455,
 'even now': 57154,
 'now the': 125486,
 'the small': 186172,
 'small grave': 165269,
 'grave is': 74067,


  • 토큰의 빈도가 max_df로 지정한 값을 초과 하거나 min_df로 지정한 값보다 작은 경우에는 무시
vect = CountVectorizer(max_df=5000, min_df=10).fit((list(x_train) + list(x_valid)))
vect.vocabulary_, vect.stop_words_
TF-IDF (Term Frequency -Inverse Documnet Frequency)

  • 인코딩은 단어를 갯수 그대로 카운트하지 않고 모든 문서에 공통적으로 들어있는 단어의 경우 문서 구별 능력이 떨어진다고 보아 가중치를 축소
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',
                      token_pattern=r'\w{1,}', ngram_range=(1,3), use_idf=1, smooth_idf=1, sublinear_tf=1,
  • 단어 카운트 가중치를 나타내는 함수로
  • min_dfsms DF(문서의 수)의 최소 빈도값 설정.
  • analyzer: ;’word’ 또는 ‘char’
  • sublinear_tf:TF(단어빈도)가 높을 경우 완만하게 처리하는 효과
  • ngram_range : 단어 묶음
  • max_features: tf-idf 벡터의 최대 feature를 설정. 단어사전의 인덱스만큼 부여
A_tfidf_sp = tfv.fit_transform(list(x_train) + list(x_valid)) 
tfidf_dict = tfv.get_feature_names()
data_array = A_tfidf_sp.toarray()
data = pd.DataFrame(data_array, columns=tfidf_dict)
(19579, 15102)
tfv.fit(list(x_train) + list(x_valid))
빈도수 높은 단어

from collections import Counter
워드클라우드로 시각화

from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(font_path='font/NanumGothic.ttf', background_color='white')
cloud = wordcloud.generate_from_frequencies(dict(tags))

plt.figure(figsize=(10, 8))


  • 참고:

