Data Acquisition and Analysis of anti-Acquisition

The theme of this, many of the Internet; today backward to the time and see a post, said the data through the acquisition of others refuse to do at the deeds of which mentioned his previous collection of others, he later collected by others and then Later, through their own to find ways to avoid anti-collection technology acquisition; saw these, some can not help but sigh, piracy and anti-piracy, anti-acquisition and collection, plagiarism and anti-plagiarism, are aimed at the interests of this! What is more, to see an abnormal level, in the Trusteeship Council on the release of the more than 2,000 garbage station, through acquisition, in order to earn very little money! Uh, do not know is the master or rubbish! In fact, I do not oppose the acquisition, network resources, is shared by them, but regardless of Qing-hong, all included, such an approach, it is very low!

First, the acquisition principles:

Acquisition for major steps are as follows:

First, access to the collection of pages of content
Second, from the access code used to extract all the data
This approach, the more popular collector is the locomotive of the 2.1 version, I also tested on this version, or use the good; above its routines about the acquisition date of the post, I found that this is very outdated Generous, although discuz procedures for collection has also taken the anti-acquisition strategy, but not limited to this date, we can facilitate the collection, which I have to admire the fish business strategy! Of course, even if it was a copy of the backward again, it will not produce the second out of date.

I train collector reference to the routines, but also to test the acquisition of several posts out of date and do experiments; have not found the number of twists and turns for the success; view, this collector’s function is very powerful, so to To refuse collection points, it really will soon be able to fill a variety of content! But in the course of the trial also found that all of Fei and Ying Zheng, some of the problems, major problems or in some steps to limit the cookie verification, so that the real cause can not be significantly out of the pages, which can not be read out in full text, If there is no body, of course, there is no way to filter content; Fei who use the phpwind, Ying Zheng used the discuz, I think, whether a site or procedures should be done restrictions. The more time pondering, the two leisure section of the site, content or very good, uh!

End collector on the principle of Xiangxi, began to talk about anti-acquisition strategy.

The current anti-collection methods there are many, first introduced the common anti-collection methods and strategies of its shortcomings and acquisition responses:

First, a judge in a certain period of time IP on this site page visits, significantly more than if the speed of a normal visit, refused to visit this IP
Drawbacks:
1, this method is only applicable to dynamic pages, such as: asp \ jsp \ php pages, and so on … can not judge a static IP for some time to visit this site the number of pages
2, this method would seriously affect the search engine spiders their record, because the search engine spiders record, the visit will be relatively fast speed and is multi-threaded. This method will also refused to search engine spiders included in the document points
Acquisition responses: acquisition speed can only be slowed down, or not -
Recommendations: to be a search engine spider’s IP library, only to allow search engine spiders Tour stops in content. Search engine spider’s IP library collection, it is not very easy, a search engine spiders, not necessarily only a fixed IP address.
Comments: This method of collecting more effective prevention, but it will affect the search engine of its record.

Second, using encryption content pages javascript

Drawbacks: This method is suitable for static pages, but would seriously affect the search engine of its record, the search engine to receive the content and the content are encrypted
Acquisition responses: proposal was not adopted, such as mining have to, they of the password-JS script also down.
Recommendations: There is no good improvement proposals
Comments: search engine with recommendations expected to flow director not to use this method.

Third, the content of the specific pages tagged with “specific copyright tag + hidden text”

Drawbacks: disadvantages of this method not only will increase a little bit of the page file size, but easy-Acquisition
Acquisition responses: the collection to contain the hidden text of the copyright for out of copyright text, or replace their copyright.
Recommendations: There is no good improvement proposals
Comments: his feeling little practical value, even with random hidden text, is also superfluous.

Fourth, after landing only allows users to browse
Drawbacks: This method will seriously affect the search engine spiders to record
Acquisition responses: Some people now out of date already made a countermeasure articles, specific measures to look at this bar “ASP thieves procedures how to use XMLHTTP to achieve the form of cookies or session and the sending”
Recommendations: There is no good improvement proposals
Comments: search engine with recommendations expected to flow director not to use this method. But this method-general of the acquisition process, or a little effect.

5, with javascript, vbscript script to do tabbed
Disadvantages: its impact on search engine record
Acquisition responses: Analysis of javascript, vbscript script, find its paging rules, their counterparts here to be a collection at the page to page.
Recommendations: There is no good improvement proposals
Comments: feeling understood, the scripting language is able to identify its paging rules

6, allowing only through this site pages link Show, such as: Request.ServerVariables ( “HTTP_REFERER”)
Disadvantages: its impact on search engine record
Acquisition responses: do not know the source page can be simulated. . . . At present I do not have this method of collecting the corresponding countermeasures
Recommendations: There is no good improvement proposals
Comments: search engine with recommendations expected to flow director not to use this method. But this method-general of the acquisition process, or a little effect.

From the above we can see that the commonly used anti-collection methods, or on search engines include a greater impact or effect of anti-acquisition bad, will not achieve the effect of anti-acquisition. So, is there an effective anti-collection, without affecting the way search engines record? » Then please continue down it out, wonderful place immediately brought to everyone.

Below is my anti-acquisition strategy, anti-acquisition without anti-search engine

I speak from the front of the principle of the acquisition we can see that the vast majority of procedures are collected by a collection of rules, such as paging file name of the rules, the rules of the code pages.

First, the rules were anti-page document collection measures

Most of the collector are on page document of rules, bulk, multi-page collection. If other people not find your page file of the rules, then others will not be able to your site bulk-page collection.
Method:
I think the use of MD5 encrypted page document is a good way, talking about here, some people will say that you use MD5 encryption paging file name, others under this rule can also simulate your encryption rules of your page document Name.

I want to point out is our encryption paging file name, will not only change the encrypted documents were part of
If I page on behalf of the page, then we should not such encryption
page_name = Md5 (I, 16) & “. htm”

To the best encryption to follow up on the pages of one or more characters, such as: page_name = Md5 (I & “any one or a few letters”, 16) & “. Htm”

MD5 is not because of anti-declassified, others will see the letters page is the result of MD5 encryption, and also I can not tell you in the back of follow-up letter is, unless he used violence **** MD5, but Unrealistic.

Second, the rules of the code-page collection measures

If our content pages without code rules, then others can not be from your code they need to extract the contents of an article.
This is why we must step in to prevent the acquisition, it is necessary to make the code without rules.
Method:
Extraction of the need to bring the other side of the random tag
1, a number of custom page templates, each page templates, the importance of different HTML tags, showing the page, randomly selected page templates, and some pages with CSS + DIV layout, and some pages with table layout, this Is in trouble, a content pages, it is necessary to do a few template pages, but anti-collection itself is a very tedious thing to do more of a template, can play a role in defense acquisition, for many people , Are worth it.
2, if suspected of the above method is too troublesome to the important pages in HTML tags random, and can also.

To do more page templates, html code more random, from the other side of the content code, the more trouble, the other side of the site for you to write specialized acquisition strategy, the greater the difficulty, at this time, most people will Clear: because this person is because lazy, others will be collecting data website ~ ~ ~ repeat them on, most people are taking other people to develop procedures for the collection of data acquisition, the development of their collection procedures to collect data After all, is one of the few.

Also some simple ideas to you:
1, the data acquisition are important, and the search engine is not important client with the contents of the script show
2, a data, divided into N-page display, but also increase the difficulty of collecting methods
3, with more profound connection, because most of the current acquisition process can only be collected to the site of the former three-tier content, if the content of the deeper layers of connections, but also avoid being collected. But this may give the customers here on the inconvenience.
Such as:
Most sites are content index page Home —- —- content pages
If the change:
Home contents of the index page —- —- —- import content page content page
Note: page content preferably with automatic entrance into the content of the page code

<meta http-equiv=”refresh” content=”6;url=内容页( http://www.jz123.cn)”>
In fact, as long as the first step to anti-acquisition (encrypted page document of rules), anti-acquisition results have been good, or the proposed two-collection method used at the same time, to the acquisition to increase collection difficult, difficult for them to know page Retreat.

Road force Mogaoyizhang, to be a really Burongyia site! Therefore, the general director cattle comparison, the code capacity are relatively strong. Plight of those without hard labor and self-chief of those, hard work, the work overnight on a copied by others but empty; regrettable ah! Therefore, you are master, in a civic-minded, still need to talk about the January 23!

Source Chinese Address:http://www.linwan.net.cn/archives/2088.html

Related posts:

  1. China’s most powerful engine site Collect system – Collector Some time ago when contact with the collection system, with...
  2. “There is no hypertext clipboard data” solution The use of posted online tools, always tips, “there is...
  3. sql injection of GM-download the latest version 3.2 beta This process of updating, asp developers is a big happy...
  4. RME series sound card DIGICheck sound recording analysis software version some.42. RME range sound card DIGICheck sound recording analysis software version...
  5. RME range sound card DIGICheck audio analysis software version four.53 For Win2000/XP/XP-64/Vista-32/Vista-64. RME selection sound card DIGICheck acoustic analysis software version contemplate.53...

Leave a Reply