ABSTRACT submited


BACKPROPAGATION ON A NOVEL NETWORK (NON ANN) FOR HATESPEECH DETECTION AND MORE



This paper introduces a novel type of network that can accurately classify "hate speech" and differentiate it from normal content. It develops a network centered around a unique POS (part of speech) tagging type of system whose specifics are learnt by the network. Each word is assigned variables that indicate the polarity, (sign) of the effect a particular word has on every other word in a sentence, the magnitude of that polarity and also to what extent the word filters itself from the effect from other words. When choosing a set of properties, we would like to properly understand what we want the network to do. We know that both posts classified as hate speech and those not contain samples of words from a common pool (they share words). But we still want that the results for opposite classes to polarize. On top of this I believe that this represents a complex system, in that small changes to the post (such as the presence of the word "NOT" in an otherwise innocent statement) could complete reverse the polarity of the classification. And also, the converse where large changes do nothing. Which should be easy to imagine as a lengthy elaboration of either a well-meaning message or the converse even when that word, NOT, is a part of it, but in a different role. We would like to capture all of this information somehow, or else the system will not converge. Note: This is not a gram-based system. The variables an individual word has are local to it in the sense that they are not a mapping directly between words. It is not the case that we have mapped every word to every other word with 3 mappings representing the three variables. That would only result in an explosion of variables, because to have the same efficiency as the system I am presenting, that system of mappings, would need to map the powerset of the set of all words in a language to itself for every possible mapping in the powerset of the set of all words in the language, onto itself. Note that the system described in this paper truly possesses all the detail in such a mapping, and if we were to calculate the number of variables needed to represent the system presented in this paper, we would really only need a value of, 3 X [the number of words in the language]. Because we only have an explicit mapping between each word and its own 3 variables. Then All the other information contained in the powerset mapping, which is in some sense complete, is learnt implicitly by adjusting each word’s variables with gradient decent optimization, designed to optimize the effect the words three variables have on the other words (words not word’s variables), in the sentence. This may sound like I am introducing a mapping when I talk of this effect that individual words have on each other in a sentence but I am really not and will now clarify. I will call the variables "permeability", "stream", and "angle” and they will function as follows. (Again, each word has its own permeability, its own stream and its own angle.) Every word will emit a stream from the its left side and its right side, which will end at the limits of the sentence, i.e. the two periods at either end. Each word’s stream will pour out at a value equal to stream variable value it possesses. So, it will send the exact same amount of stream to all the other words in each and every sentence it is in. (And will do so in all other possible sentences that it may be found in in the future.). Note that the stream a word is sending out, will not be counted as being received by it, so a word receives every other word’s stream except its own. Then In the process of determining its own value (the word’s value, not the value of its variables) in the sentence, each word in it will add up the total amount of stream from all words it has received combined. And scale it by its own permeability variable. So, the permeability acts as a filter for each word. Filtering out or in the amount of stream it gets. Since different words have different permeabilities, it will result in every word collecting a different amount of stream and hence for them to have unique values in the sentence. Every word will have a permeability associated with it. Permeability is local to a word in the sense that the other words have no direct way of knowing what the permeability of other words in the sentence have. Stream is not local to its word in the sense that it goes to every other word except itself but IS local to its word in the sense that it belongs to the word that possess it as its variable. Each word will also have an angle. This is a value between 1 and -1. This value moves along with stream and has the same local / non-local profile, multiplying every word’s total stream received by either a negative number or a positive number, and hence changing the final computed value for a particular words polarity (i.e. whether it will be positive or negative and ultimately determining the sign of the magnitude of the value of the sentence. We then compute a value for each word by taking the polarized net stream received, scaled by its permeability store the result in a temporary variable, temp_word. So, there will be as many temp_word variables indexed by the word that the value it contains was calculated from. Then the values of each temp_word in the sentence are summed and stored in a variable temp_sentence. A word’s variables do NOT contribute to its own personal value in the equation. Only the rest of the words in the sentences, variables values affect it. This goes true, round robin for calculating the value of each word in the sentence. This may seem counter intuitive but it will always result in the same word from two different sentences having a different value in each. That is how we go about differentiating largely similar sentences. Of course, even when we initialize the 3 variables for each word randomly this will be true, however the system will be hopeless for classification because ultimately the pattern of the value of the final sentence (as calculated from the values of its words) will be random and you would not be able to tell using it if the sentence contained hate speech or good speech or differentiate anything else from its value either. But we will fix this by training the system to adjust the values of each word’s variables so that they are no longer random but optimized to generate sentences with one value for hate speech and a diametrically opposite value for good speech, based on the way each other word in the sentence influences the value of the sentence. [Note that we can actually optimize the system for different tasks such as question answering or even for designing novel drugs according to a written specification]. So, we will have an iterative loop where we adjust the variables of each word along the negative gradient of the cost function. This is surprisingly like an ANN, but there are no weights or biases, or even neurons, despite that we can still back propagate the error. To perform backpropagation, we will use the chain rule, using the chain starting from the sentence’s (temp_sentence), then in a “lower layer” the temp values of the words in that sentence temp_word (each temp_word in the sentence will be a “node” in this layer), and ultimately all these temp_values will change values through changing the values of each word in the sentences’ 3 variables, permeability, stream and angle. The algorithm implements back propagation with gradient descent to learn, by adjusting the value of the temp_variables at each level, while forward propagation involves computing a value for each word in a sentence by considering the effects the other words have on it, then summing the values to get a value for the sentence. Whose value is then compared with either an expected value of 1 or -1, during the calculation of the cost function for the example, during training, and for classification purposes, during operation. Note that the network will have a different topology of nodes for each training example, parametrized by the number of words in each sentence and the number of sentences in that particular example. Also, when multiplying by the learning rate we do not multiply the temp_variables values directly, but indirectly by multiplying the value of the lowest layer of nodes made from the variables, permeability, stream and angle of each word in the current sentences making the current topology. Finally, in other roles e.g.Image classification, each pixel can have 3 variables associated with it that emanate in all directions. And on the second layer of nodes segmentation occurs around those pixels with a value of 1 and different alignments of different segments occur when another ratio is met to make yet another layer of nodes. Alternatively the type of image segmentation described could be the basis of new types of data types (the segments), the analogues of concepts ,where a meaning space is partitioned, an initially (meaningless) Euclidean space would suffice but as we segment it the system becomes complex (more meaningful). Were the ratio of distances between segments/concepts don’t sum up in an intuitive or linear way, and the euclidean meaning space becomes a type of non euclidean manifold where these concepts segments of the space )are embedded. And each concepts context  interacts with other concepts contexts in a complex manner.


























Comments

  1. If this is a convex optimisation problem, then we only have to know what the global minmum is to evaluate this algorythms efficiency.

    I beleive that this may indicate so about the convexivity of the cost function.

    https://papers.nips.cc/paper/2800-convex-neural-networks.pdf

    ReplyDelete
  2. As well as this

    https://arxiv.org/abs/1412.8690

    ReplyDelete
  3. And

    https://link.springer.com/content/pdf/10.1007%2Fs11633-017-1054-2.pdf

    ReplyDelete

Post a Comment

Popular posts from this blog

Pattern Recognition