تنفيذ K أقرب الجيران

المتطلب السابق: ك أقرب الجيران

مقدمة

لنفترض أننا حصلنا على مجموعة بيانات من العناصر لكل منها ميزات ذات قيمة عددية (مثل الطول والوزن والعمر وما إلى ذلك). إذا كان عدد الميزات هو ن يمكننا تمثيل العناصر كنقاط في ن -شبكة الأبعاد. بالنظر إلى عنصر جديد، يمكننا حساب المسافة من العنصر إلى كل عنصر آخر في المجموعة. نحن نختار ك أقرب الجيران ونرى أين يتم تصنيف معظم هؤلاء الجيران. ونصنف العنصر الجديد هناك.
لذلك تصبح المشكلة كيف يمكننا حساب المسافات بين العناصر. الحل لهذا يعتمد على مجموعة البيانات. إذا كانت القيم حقيقية فإننا عادة نستخدم المسافة الإقليدية. إذا كانت القيم فئوية أو ثنائية فإننا عادة نستخدم مسافة هامينغ.
الخوارزمية:

Given a new item: 1. Find distances between new item and all other items 2. Pick k shorter distances 3. Pick the most common class in these k distances 4. That class is where we will classify the new item

قراءة البيانات

البذور مقابل الجراثيم

دع ملف الإدخال الخاص بنا يكون بالتنسيق التالي:

Height Weight Age Class 1.70 65 20 Programmer 1.90 85 33 Builder 1.78 76 31 Builder 1.73 74 24 Programmer 1.81 75 35 Builder 1.73 70 75 Scientist 1.80 71 63 Scientist 1.75 69 25 Programmer

كل عنصر عبارة عن سطر وتحت "الفئة" نرى المكان الذي تم تصنيف العنصر فيه. القيم الموجودة أسفل أسماء الميزات ("الارتفاع" وما إلى ذلك) هي القيمة التي يحتوي عليها العنصر لتلك الميزة. يتم فصل كافة القيم والميزات بفواصل.
ضع ملفات البيانات هذه في دليل العمل data2 و بيانات . اختر واحدة والصق المحتويات كما هي في ملف نصي اسمه بيانات .
سوف نقرأ من الملف (المسمى "data.txt") وسنقوم بتقسيم الإدخال حسب الأسطر:

f = open('data.txt' 'r'); lines = f.read().splitlines(); f.close();

يحتوي السطر الأول من الملف على أسماء الميزات مع الكلمة الأساسية "فئة" في النهاية. نريد تخزين أسماء الميزات في قائمة:

# Split the first line by commas # remove the first element and # save the rest into a list. The # list now holds the feature # names of the data set. features = lines[0].split(' ')[:-1];

ثم ننتقل إلى مجموعة البيانات نفسها. سنقوم بحفظ العناصر في قائمة اسمها أغراض عناصرها عبارة عن قواميس (واحد لكل عنصر). مفاتيح قواميس العناصر هذه هي أسماء الميزات بالإضافة إلى "الفئة" للاحتفاظ بفئة العنصر. في النهاية نريد خلط العناصر الموجودة في القائمة (وهذا إجراء أمان في حالة وجود العناصر بترتيب غريب).

Python3

items = []; for i in range(1 len(lines)): line = lines[i].split(' '); itemFeatures = {'Class' : line[-1]}; # Iterate through the features for j in range(len(features)): # Get the feature at index j f = features[j]; # The first item in the line # is the class skip it v = float(line[j]); # Add feature to dict itemFeatures[f] = v; # Append temp dict to items items.append(itemFeatures); shuffle(items);

تصنيف البيانات

مع البيانات المخزنة في أغراض نبدأ الآن في بناء المصنف الخاص بنا. بالنسبة للمصنف سنقوم بإنشاء وظيفة جديدة تصنيف . سوف يستغرق كمدخل العنصر الذي نريد تصنيف قائمة العناصر و ك عدد أقرب الجيران.
لو ك أكبر من طول مجموعة البيانات، لذلك لا يمكننا المضي قدمًا في التصنيف لأنه لا يمكن أن يكون لدينا جيران أقرب من إجمالي عدد العناصر في مجموعة البيانات. (بدلاً من ذلك يمكننا تعيين k كـ أغراض الطول بدلاً من إرجاع رسالة خطأ)

if(k > len(Items)): # k is larger than list # length abort return 'k larger than list length';

نريد حساب المسافة بين العنصر المراد تصنيفه وجميع العناصر الموجودة في مجموعة التدريب في النهاية مع الحفاظ على ك أقصر المسافات. للاحتفاظ بأقرب الجيران الحاليين نستخدم قائمة تسمى الجيران . كل عنصر على الأقل يحمل قيمتين واحدة للمسافة من العنصر المراد تصنيفه وأخرى للفئة التي ينتمي إليها الجار. وسوف نقوم بحساب المسافة عن طريق الصيغة الإقليدية المعممة (ل ن أبعاد). ثم سنختار الفصل الذي يظهر فيه معظم الوقت الجيران وسيكون هذا اختيارنا. في الكود:

Python3

def Classify(nItem k Items): if(k > len(Items)): # k is larger than list # length abort return 'k larger than list length'; # Hold nearest neighbors. # First item is distance  # second class neighbors = []; for item in Items: # Find Euclidean Distance distance = EuclideanDistance(nItem item); # Update neighbors either adding # the current item in neighbors  # or not. neighbors = UpdateNeighbors(neighbors item distance k); # Count the number of each # class in neighbors count = CalculateNeighborsClass(neighbors k); # Find the max in count aka the # class with the most appearances. return FindMax(count);

الوظائف الخارجية التي نحتاج إلى تنفيذها هي المسافة الإقليدية تحديث الجيران حساب الجيران و FindMax .

إيجاد المسافة الإقليدية

الشمولية في جافا

الصيغة الإقليدية المعممة لمتجهين x و y هي كما يلي:

distance = sqrt{(x_{1}-y_{1})^2 + (x_{2}-y_{2})^2 + ... + (x_{n}-y_{n})^2}

في الكود:

Python3

def EuclideanDistance(x y): # The sum of the squared  # differences of the elements S = 0; for key in x.keys(): S += math.pow(x[key]-y[key] 2); # The square root of the sum return math.sqrt(S);

تحديث الجيران

لدينا قائمة جيراننا (والتي يجب أن يبلغ طولها على الأكثر ك ) ونريد إضافة عنصر إلى القائمة بمسافة معينة. أولا سوف نتحقق مما إذا كان الجيران يبلغ طولها ك . إذا كان به أقل نضيف إليه العنصر بغض النظر عن المسافة (حيث نحتاج إلى ملء القائمة حتى ك قبل أن نبدأ في رفض العناصر). إذا لم يكن الأمر كذلك، فسوف نتحقق مما إذا كان العنصر يحتوي على مسافة أقصر من العنصر الذي لديه أقصى مسافة في القائمة. إذا كان ذلك صحيحًا، فسوف نستبدل العنصر ذو المسافة القصوى بالعنصر الجديد.
للعثور على عنصر المسافة القصوى بسرعة أكبر، سنحتفظ بالقائمة مرتبة بترتيب تصاعدي. وبالتالي فإن العنصر الأخير في القائمة سيكون له أقصى مسافة. سنقوم باستبداله بعنصر جديد وسنقوم بفرزه مرة أخرى.
لتسريع هذه العملية يمكننا تنفيذ فرز الإدراج حيث نقوم بإدراج عناصر جديدة في القائمة دون الحاجة إلى فرز القائمة بأكملها. على الرغم من أن الكود الخاص بهذا طويل إلى حد ما وعلى الرغم من بساطته إلا أنه سيعرقل البرنامج التعليمي.

Python3

def UpdateNeighbors(neighbors item distance k): if(len(neighbors) > distance): # If yes replace the last # element with new item neighbors[-1] = [distance item['Class']]; neighbors = sorted(neighbors); return neighbors;

حساب الجيران

سنقوم هنا بحساب الفصل الذي يظهر في أغلب الأحيان الجيران . ولهذا سوف نستخدم قاموسًا آخر يسمى عدد حيث المفاتيح هي أسماء الفئات التي تظهر فيها الجيران . إذا لم يكن هناك مفتاح فسنضيفه وإلا فسنزيد قيمته.

Python3

def CalculateNeighborsClass(neighbors k): count = {}; for i in range(k): if(neighbors[i][1] not in count): # The class at the ith index # is not in the count dict. # Initialize it to 1. count[neighbors[i][1]] = 1; else: # Found another item of class  # c[i]. Increment its counter. count[neighbors[i][1]] += 1; return count;

FindMax

الأخوة كايلي جينر

سوف نقوم بإدخال القاموس في هذه الوظيفة عدد نحن نبني في حساب الجيران وسوف نقوم بإرجاع الحد الأقصى.

Python3

def FindMax(countList): # Hold the max maximum = -1; # Hold the classification classification = ''; for key in countList.keys(): if(countList[key] > maximum): maximum = countList[key]; classification = key; return classification maximum;

خاتمة

وبذلك تم الانتهاء من هذا البرنامج التعليمي لـ kNN.
يمكنك الآن تصنيف إعداد العناصر الجديدة ك كما تراه مناسبا. عادةً ما يتم استخدام رقم فردي لـ k ولكن هذا ليس ضروريًا. لتصنيف عنصر جديد، تحتاج إلى إنشاء قاموس يحتوي على مفاتيح وأسماء الميزات والقيم التي تميز العنصر. مثال على التصنيف:

newItem = {'Height' : 1.74 'Weight' : 67 'Age' : 22}; print Classify(newItem 3 items);

الكود الكامل للنهج أعلاه موضح أدناه: -

Python3

# Python Program to illustrate # KNN algorithm # For pow and sqrt import math from random import shuffle ###_Reading_### def ReadData(fileName): # Read the file splitting by lines f = open(fileName 'r') lines = f.read().splitlines() f.close() # Split the first line by commas  # remove the first element and save # the rest into a list. The list  # holds the feature names of the  # data set. features = lines[0].split(' ')[:-1] items = [] for i in range(1 len(lines)): line = lines[i].split(' ') itemFeatures = {'Class': line[-1]} for j in range(len(features)): # Get the feature at index j f = features[j] # Convert feature value to float v = float(line[j]) # Add feature value to dict itemFeatures[f] = v items.append(itemFeatures) shuffle(items) return items ###_Auxiliary Function_### def EuclideanDistance(x y): # The sum of the squared differences # of the elements S = 0 for key in x.keys(): S += math.pow(x[key] - y[key] 2) # The square root of the sum return math.sqrt(S) def CalculateNeighborsClass(neighbors k): count = {} for i in range(k): if neighbors[i][1] not in count: # The class at the ith index is # not in the count dict.  # Initialize it to 1. count[neighbors[i][1]] = 1 else: # Found another item of class  # c[i]. Increment its counter. count[neighbors[i][1]] += 1 return count def FindMax(Dict): # Find max in dictionary return  # max value and max index maximum = -1 classification = '' for key in Dict.keys(): if Dict[key] > maximum: maximum = Dict[key] classification = key return (classification maximum) ###_Core Functions_### def Classify(nItem k Items): # Hold nearest neighbours. First item # is distance second class neighbors = [] for item in Items: # Find Euclidean Distance distance = EuclideanDistance(nItem item) # Update neighbors either adding the # current item in neighbors or not. neighbors = UpdateNeighbors(neighbors item distance k) # Count the number of each class  # in neighbors count = CalculateNeighborsClass(neighbors k) # Find the max in count aka the # class with the most appearances return FindMax(count) def UpdateNeighbors(neighbors item distance k ): if len(neighbors) < k: # List is not full add  # new item and sort neighbors.append([distance item['Class']]) neighbors = sorted(neighbors) else: # List is full Check if new  # item should be entered if neighbors[-1][0] > distance: # If yes replace the  # last element with new item neighbors[-1] = [distance item['Class']] neighbors = sorted(neighbors) return neighbors ###_Evaluation Functions_### def K_FoldValidation(K k Items): if K > len(Items): return -1 # The number of correct classifications correct = 0 # The total number of classifications total = len(Items) * (K - 1) # The length of a fold l = int(len(Items) / K) for i in range(K): # Split data into training set # and test set trainingSet = Items[i * l:(i + 1) * l] testSet = Items[:i * l] + Items[(i + 1) * l:] for item in testSet: itemClass = item['Class'] itemFeatures = {} # Get feature values for key in item: if key != 'Class': # If key isn't 'Class' add  # it to itemFeatures itemFeatures[key] = item[key] # Categorize item based on # its feature values guess = Classify(itemFeatures k trainingSet)[0] if guess == itemClass: # Guessed correctly correct += 1 accuracy = correct / float(total) return accuracy def Evaluate(K k items iterations): # Run algorithm the number of # iterations pick average accuracy = 0 for i in range(iterations): shuffle(items) accuracy += K_FoldValidation(K k items) print accuracy / float(iterations) ###_Main_### def main(): items = ReadData('data.txt') Evaluate(5 5 items 100) if __name__ == '__main__': main()

الإخراج:

0.9375

يمكن أن يختلف الإخراج من آلة إلى أخرى. يتضمن الكود وظيفة Fold Validation ولكنها لا علاقة لها بالخوارزمية، فهي موجودة لحساب دقة الخوارزمية.

إنشاء اختبار