Many people had the experience of checking their papers when they were in school. Generally, you can submit your thesis through platforms, and the platform will match the thesis with other theories for checking and finally get a similarity. I don’t know if you have thought about the checking problem. How is it achieved behind? Is it just a matter of finding similar strings to match? This article at algo.monster on articlelab.com will take you through a classical algorithm that can solve this problem: the longest common subsequence.
This article mainly contains the following three aspects.
- what is the longest common subsequence?
- how to implement the LCS calculation
- the use of LCS scenarios
1. What is the longest common subsequence?
The longest common subsequence is typical, and it belongs to a dynamic programming algorithm. Its primary purpose is to solve the longest subsequence of two sequences. Many people often confuse with the longest common subsequence problem. We will introduce the difference between the two by an example later.
Suppose there is a string “ABCBDAB,” and the subsequences of the series can be connected as follows
– ABC
– ABCBD
– BCBD
– CBDA
or they can be unconnected but in the following order
– BBA
– CDAB
– BCAB
– ACDB
So the critical thing here is the order of substrings, and it does not matter whether they are or not.
If the algorithm is for the case of concatenated strings, it is solving for the longest common subsequence.
So the longest common subsequence is defined as the longest common subsequence of a sequence S, if it is a subsequence of two or more known sequences, respectively, and is the longest of all lines that meet this condition. That is, to find the “similarity” between these sequences.
Definition extension
The longest common subsequence is a problem to find the longest subsequence of all sequences in a set of lines (usually two sequences). It is different from the problem of finding the longest common subsequence. The subsequences do not need to occupy consecutive positions in the original sequence. And the longest common substring (which requires contiguity) and the longest common subsequence are different.
Also, in computer science, the longest incremental subsequence is the one that finds a subsequence. Make the elements in this subsequence increase in importance sequentially, and the length of this subsequence is as huge as possible. The details in the longest increasing subsequence are not necessarily consecutive in the original sequence. Many mathematics, algorithms, random matrix theory, and representation theory involve the longest increasing subsequence.
Complexity
When the number of sequences is sure for general LCS problems (i.e., an arbitrary number of sequences), the problem can be solved in polynomial time using Dynamic Programming (DP).
There is an optimal substructure for the longest common subsequence problem: break the problem into smaller, simpler “subproblems,” which can be divided into more subproblems to simplify the whole situation. The solutions of the subproblems of the longest common subsequence problem are reusable, i.e., higher-level subproblems usually reuse the solutions of lower-level subproblems. Problems with these two properties can be solved using dynamic programming algorithms to store the keys of the problem without having to be computed repeatedly. This process requires keeping the keys of subproblems of the same level in a table to use the answers by higher-level subproblems.
2. How to compute the longest common subsequence problem?
If we solve this problem by brute force, how much computation is needed? Let’s first look at the number of subsequences.
The string’s length is 1, and there is only one subsequence, the string itself.
– The length of the string is 2. For example, in the string AB, the subsequence contains A, B, AB three.
– String length is 3. For example, the string ABC, the subsequence contains A, B, C, AB, BC, AC, ABC seven.
Conclusion
When the length is N, the number of subsequences is 2 to the Nth power-1, and as the length increases, the length of its subsequences will show an exponential rise. For a large amount of data, this is a very unpleasant way of calculation.
In the section on the definition of LCS, we mentioned that LCS belongs to the dynamic programming algorithm problem, and two typical characteristics of dynamic programming are.
- the optimal solution of the problem contains the optimal solutions of its subproblems.
- the subproblem space is finite and does not generate infinite subproblems, and there are subsequent computational overlaps.
It does not matter if you have not been exposed to dynamic programming before. We will introduce it slowly with examples later.
Suppose we have two sequences that need to solve the LCS, {x1,x2…. .x(m)} sequence and {y1,y2,… To solve the LCS of the current two sequences, we can split them into subproblems of shorter sequences, and the subproblems can continue to split until the sequence is empty.
Here we derive the flow of computing the subproblem.
Suppose the last two values of the two sequences x(m)=y(n). The LCS of the sequence is the original smaller subsequence {x1,x2 …. x(m-1)} and {y1,y2, y3… y(n-1)} of the sequence of LCS plus that equal value because the last values are the same. The order is all at the end, so this value must be part of the original LCS, and there exists x(m)=y(n)=z(k1).
As an example, for the sequence “ABCD” and “ACD,” the LCS of both sequences is “ACD,” and the tail values are the same. If both are removed, the LCS is also reduced by one value, so the length of the current sequence LCS is LCS[i, j ] = LCS(i-1, j-1) + 1
If we know the expression of the sequence, then we can do the calculation and trace the result by some matrix in solving.
Taking an example
Let’s introduce how to solve the above expression by an intermediate matrix through a slightly more complicated sequence solving process. Suppose we have a sequence A = “ABCBDAB” and a sequence B = “BDCABA”. We first initialize a two-dimensional matrix, processed according to the following rules.
- the row height of the two-dimensional matrix is M+1, where M is the length of the sequence A. The column height is N+1, where N is the length of the sequence B. 2.
- if xi == yj , update the current value to the diagonal value + 1
- if matrix xi! = yj then update the current value to the maximum of the value above and the left and fill a directional arrow pointing to the maximum. However, if the left value is the same as the value above, it points to the value above (this condition contains the equals relationship)
3. What scenarios and when is the longest common subsequence?
(1) DNA sequence itself can be seen as a string composed of four characters “A, C, G, T,” you can use the LCS algorithm to do similarity matching for two DNA sequences, find the LCS of two DNA sequences, you can know the difference between them.
(2) Similarity check for two papers to check the similarity of two articles and find out if there is any suspicion of plagiarism.
(3) Git-diff, a code matching algorithm in the Git system. With the above introduction, perhaps you have some idea of solving the LCS and implementing a thesis check.