ABSTRACT submited
BACKPROPAGATION ON A NOVEL NETWORK (NON ANN) FOR HATESPEECH DETECTION
AND MORE
This paper introduces a novel type of network that can accurately classify
"hate speech" and differentiate it from normal content. It develops a
network centered around a unique POS (part of speech) tagging type of system
whose specifics are learnt by the network. Each word is assigned variables that
indicate the polarity, (sign) of the effect a particular word has on every
other word in a sentence, the magnitude of that polarity and also to what
extent the word filters itself from the effect from other words. When choosing
a set of properties, we would like to properly understand what we want the
network to do. We know that both posts classified as hate speech and those not
contain samples of words from a common pool (they share words). But we still
want that the results for opposite classes to polarize. On top of this I
believe that this represents a complex system, in that small changes to the
post (such as the presence of the word "NOT" in an otherwise innocent
statement) could complete reverse the polarity of the classification. And also,
the converse where large changes do nothing. Which should be easy to imagine as
a lengthy elaboration of either a well-meaning message or the converse even
when that word, NOT, is a part of it, but in a different role. We would like to
capture all of this information somehow, or else the system will not converge.
Note: This is not a gram-based system. The variables an individual word has are
local to it in the sense that they are not a mapping directly between words. It
is not the case that we have mapped every word to every other word with 3
mappings representing the three variables. That would only result in an
explosion of variables, because to have the same efficiency as the system I am
presenting, that system of mappings, would need to map the powerset of the set
of all words in a language to itself for every possible mapping in the powerset
of the set of all words in the language, onto itself. Note that the system
described in this paper truly possesses all the detail in such a mapping, and
if we were to calculate the number of variables needed to represent the system
presented in this paper, we would really only need a value of, 3 X [the number
of words in the language]. Because we only have an explicit mapping between
each word and its own 3 variables. Then All the other information contained in
the powerset mapping, which is in some sense complete, is learnt implicitly by
adjusting each word’s variables with gradient decent optimization, designed to
optimize the effect the words three variables have on the other words (words
not word’s variables), in the sentence. This may sound like I am introducing a
mapping when I talk of this effect that individual words have on each other in
a sentence but I am really not and will now clarify. I will call the variables
"permeability", "stream", and "angle” and they will
function as follows. (Again, each word has its own permeability, its own stream
and its own angle.) Every word will emit a stream from the its left side and
its right side, which will end at the limits of the sentence, i.e. the two
periods at either end. Each word’s stream will pour out at a value equal to
stream variable value it possesses. So, it will send the exact same amount of
stream to all the other words in each and every sentence it is in. (And will do
so in all other possible sentences that it may be found in in the future.).
Note that the stream a word is sending out, will not be counted as being
received by it, so a word receives every other word’s stream except its own.
Then In the process of determining its own value (the word’s value, not the
value of its variables) in the sentence, each word in it will add up the total
amount of stream from all words it has received combined. And scale it by its
own permeability variable. So, the permeability acts as a filter for each word.
Filtering out or in the amount of stream it gets. Since different words have
different permeabilities, it will result in every word collecting a different
amount of stream and hence for them to have unique values in the sentence.
Every word will have a permeability associated with it. Permeability is local
to a word in the sense that the other words have no direct way of knowing what
the permeability of other words in the sentence have. Stream is not local to
its word in the sense that it goes to every other word except itself but IS
local to its word in the sense that it belongs to the word that possess it as
its variable. Each word will also have an angle. This is a value between 1 and
-1. This value moves along with stream and has the same local / non-local
profile, multiplying every word’s total stream received by either a negative
number or a positive number, and hence changing the final computed value for a
particular words polarity (i.e. whether it will be positive or negative and
ultimately determining the sign of the magnitude of the value of the sentence.
We then compute a value for each word by taking the polarized net stream
received, scaled by its permeability store the result in a temporary variable,
temp_word. So, there will be as many temp_word variables indexed by the word
that the value it contains was calculated from. Then the values of each
temp_word in the sentence are summed and stored in a variable temp_sentence. A word’s
variables do NOT contribute to its own personal value in the equation. Only the
rest of the words in the sentences, variables values affect it. This goes true,
round robin for calculating the value of each word in the sentence. This may
seem counter intuitive but it will always result in the same word from two
different sentences having a different value in each. That is how we go about
differentiating largely similar sentences. Of course, even when we initialize
the 3 variables for each word randomly this will be true, however the system
will be hopeless for classification because ultimately the pattern of the value
of the final sentence (as calculated from the values of its words) will be
random and you would not be able to tell using it if the sentence contained
hate speech or good speech or differentiate anything else from its value
either. But we will fix this by training the system to adjust the values of
each word’s variables so that they are no longer random but optimized to
generate sentences with one value for hate speech and a diametrically opposite
value for good speech, based on the way each other word in the sentence
influences the value of the sentence. [Note that we can actually optimize the
system for different tasks such as question answering or even for designing
novel drugs according to a written specification]. So, we will have an
iterative loop where we adjust the variables of each word along the negative
gradient of the cost function. This is surprisingly like an ANN, but there are
no weights or biases, or even neurons, despite that we can still back propagate
the error. To perform backpropagation, we will use the chain rule, using the
chain starting from the sentence’s (temp_sentence), then in a “lower layer” the
temp values of the words in that sentence temp_word (each temp_word in the
sentence will be a “node” in this layer), and ultimately all these temp_values
will change values through changing the values of each word in the sentences’ 3
variables, permeability, stream and angle. The algorithm implements back
propagation with gradient descent to learn, by adjusting the value of the
temp_variables at each level, while forward propagation involves computing a
value for each word in a sentence by considering the effects the other words
have on it, then summing the values to get a value for the sentence. Whose
value is then compared with either an expected value of 1 or -1, during the
calculation of the cost function for the example, during training, and for
classification purposes, during operation. Note that the network will have a
different topology of nodes for each training example, parametrized by the
number of words in each sentence and the number of sentences in that particular
example. Also, when multiplying by the learning rate we do not multiply the
temp_variables values directly, but indirectly by multiplying the value of the
lowest layer of nodes made from the variables, permeability, stream and angle
of each word in the current sentences making the current topology. Finally, in other
roles e.g.Image classification, each pixel can have 3 variables associated with
it that emanate in all directions. And on the second layer of nodes
segmentation occurs around those pixels with a value of 1 and different
alignments of different segments occur when another ratio is met to make yet
another layer of nodes. Alternatively the type of image segmentation described could
be the basis of new types of data types (the segments), the analogues of
concepts ,where a meaning space is partitioned, an initially (meaningless) Euclidean
space would suffice but as we segment it the system becomes complex (more
meaningful). Were the ratio of distances between segments/concepts don’t sum up
in an intuitive or linear way, and the euclidean meaning space becomes a type of non
euclidean manifold where these concepts segments of the space )are embedded.
And each concepts context interacts with other concepts contexts in a complex
manner.
If this is a convex optimisation problem, then we only have to know what the global minmum is to evaluate this algorythms efficiency.
ReplyDeleteI beleive that this may indicate so about the convexivity of the cost function.
https://papers.nips.cc/paper/2800-convex-neural-networks.pdf
As well as this
ReplyDeletehttps://arxiv.org/abs/1412.8690
And
ReplyDeletehttps://link.springer.com/content/pdf/10.1007%2Fs11633-017-1054-2.pdf