Pandas + scikit-learn K-means it doesn’t work properly - it processes all rows of data in the form of one large multi-dimensional example

I'm currently trying to do some clustering of k-values ​​using my data, which is stored in my pandas.dataframe (actually in one of its columns). The fuzzy thing is that instead of treating each line as a separate example, it threatens all lines with one example, but in a very high dimension. For example:

df = pd.read_csv('D:\\Apps\\DataSciense\\Kaggle Challenges\\Titanic\\Source Data\\train.csv', header = 0)

median_ages = np.zeros((2,3))

for i in range(0,2):
    for j in range (0,3):
        median_ages[i, j] =df[(df.Gender == i) &(df.Pclass == j+1)].Age.dropna().median()

df['AgeFill'] = df['Age']

for i in range(0, 2):
    for j in range(0,3):
        df.loc[ (df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1), 'AgeFill'] = median_ages[i, j]

then I just check that it looks fine:

df.AgeFill

Name: AgeFill, Length: 891, dtype: float64

It looks fine, 891 is a float64 number. I make excuses:

k_means = cluster.KMeans(n_clusters=1, init='random')
k_means.fit(df.AgeFill)

And I check cluster centers:

k_means.cluster_centers_

He returns me one giant array.

Further

k_means.labels_

Gives me:

array([0])

What am I doing wrong? Why does he think that I have one example with dimensions 891, instead of example 891?

, , 2 :

k_means = cluster.KMeans(n_clusters=2, init='random')
k_means.fit(df.AgeFill)

Traceback ( ): ", 1,   k_means.fit(df.AgeFill) " D:\Apps\Python\lib\site-packages\sklearn\cluster\k_means_.py ", 724,   X = self._check_fit_data (X) " D:\Apps\Python\lib\site-packages\sklearn\cluster\k_means_.py", 693, _check_fit_data   X.shape [0], self.n_clusters)) ValueError: n_samples = 1 >= n_clusters = 2

, , , .

:

df.AgeFill.shape
(891,)
+4
1

1D-, scikit 2D- . :

k_means.fit(df.AgeFill.reshape(-1, 1))

:

>>> df.AgeFill.shape
(891,)

:

>>> df.AgeFill.reshape(-1, 1).shape
(891, 1)
+6

All Articles