English 中文(简体)
粉碎机——需要我的算法帮助
原标题:Python Crawler - need help with my algorithm

** 《经济及社会理事会正式记录,2008年,补编第5号》(E/2008/44)。

我写了一部拖网渔船,该拖网 and和 par子。

在第一版中,为了获得下一个有效页,I正在增加URL ID,并将无有效身份识别资料保存到档案中,在我注意到大多数有效身份识别器有返回后,有效的URLs被移至符合我需要的内容的传译员。

我做了一些统计,并读到该次复述清单——[8,18,7,17,6,16,5,15]。

见https://stackoverflow.com/questions/6809402/python-maximum-recursion-exceed-while-questing-a-python-object。 改为:

def checkNextID(ID):
    numOfRuns = 0
    global curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(7) # sleep every 10 iterations
                numOfRuns = 0
            if isValid(ID + 8): 
                parseHTML(curRes, ID)
                ID = ID + 8
            elif isValid(ID + 18):
                parseHTML(curRes, ID)
                ID = ID + 18
            elif isValid(ID + 7):
                parseHTML(curRes, ID)
                ID = ID + 7
            elif isValid(ID + 17):
                parseHTML(curRes, ID)
                ID = ID + 17
            elif isValid(ID+6):
                parseHTML(curRes, ID)
                ID = ID + 6
            elif isValid(ID + 16):
                parseHTML(curRes, ID)
                ID = ID + 16
            elif isValid(ID + 5):
                parseHTML(curRes, ID)
                ID = ID + 5
            elif isValid(ID + 15):
                parseHTML(curRes, ID)
                ID = ID + 15
            else:
                if isValid(ID + 1):
                    parseHTML(curRes, ID)
                ID = ID + 1
        except Exception, e:
            print "something went wrong: " + str(e) 

<>strong>isValid()是一种功能,可取得一个识别+一个子背书和返回。 如果该圆顶包含我所需要的内容,并且把该圆顶的垫子拯救到一个叫做“曲线”的全球变数上,那么如果该圆顶的斜线含有我所需要的数据,并且拯救了该圆形的坏片。

<>strong>parse/2006/2/strong>是一种功能,可获取大物体(curRes),按我需要的数据进行分类,然后将数据保存到一栏,然后返回。

在一个完美的世界上,一部法典将成为我所需要的一切工具,以掌握所有有效的身份证(在5M系列中大约有400K),它给我带来了更小的时间(x50更快)。

英国广播公司在缩小任何有效URL的幅度时,其代码非常低效,因此,I m 拖网在每艘航程中多半是相同的URLs,这是因为,我把该身份证增加一个以便保持进展,直到我找到下一个有效的URL,然后是Im 核对ID+8, 即18, 17 等,这有时使我在以前的电离中检查同样的URLs I。

so I went and changed the code so it will keep a a set of non-valid URLs which I ll avoid checking again and I can t get it to work, I m breaking my head for a few hours now, it s not working like it should.

这是我的新职能:

def checkNextID(ID):
    runs = 0
    badTries = set()
    diff = [8,18,7,17,6,16,5,15,1]
    global curRes, lastResult
    while ID < lastResult:
        if len(badTries) == 100: #every 100 non-valid IDs I can reset badTries(I think), if I ve increased the ID 100 times it means I won t be checking previous IDs and I can start saving the non-valid IDs again.
            badTries = set()
        try:
            for i in diff:
                tryID = ID + i #tryID will be the ID + one of the subtrahends.
                if tryID not in badTries: #if we haven t already tried that ID
                    if runs % 10 == 0:
                        time.sleep(6)  #sleep every 10 requests
                    if isValid(tryID):
                        runs += 1
                        parseHTML(curRes, ID)
                        ID = tryID
                        badTries = set() #I can reset the badTries now, I m moving forward.
                        break #will move to the next id.
                    else:
                        badTries.add(ID) #save the non-valid ID to the set
                        if i == 1: #it s the last iteration of the for loop, if the subtrahend is 1 and it s not valid I must increase the ID by 1 in order to go forward.
                            ID += 1
                else:
                    ID += 1 #if the ID is not a valid ID and I ve already checked it I must increase the ID by one or I ll get into an infinite loop.
        except Exception, e:
            print "something went wrong: " + str(e) +   ID -   + str(ID)

Im 为每套无有效身份识别器,在每次要求Valid()之前 如果我已经尝试过这个身份证,那么,如果Im 电话是Valid(Valid),否则,ID就会增加一个。

that bad bad file

513025328
513025317
513025327
513025316
513025326
513025312
513025320
513025330
513025319
513025329
513025318
513025328
513025317
513025327
513025313
513025321
513025331
513025320
513025330
513025319
513025329
513025318
513025328
513025314
513025322
513025332
513025321
513025331
513025320
513025330
513025319
513025329
513025315
513025323
513025333
513025322
513025332
513025321
513025331
513025320
513025330
513025316
513025324
513025334
513025323
513025333
513025322
513025332
513025321
513025331
513025317
513025325
513025335
513025324
513025334
513025323
513025333
513025322
513025332
513025318
513025326
513025336
513025325
513025335
513025324
513025334
513025323
513025333
513025319
513025327
513025337
513025326
513025336
513025325
513025335
513025324
513025334
513025320
513025328
513025338
513025327
513025337
513025326
513025336
513025325
513025335
513025321
513025329
513025339
513025328
513025338
513025327
513025337
513025326
513025336
513025322
513025330
513025340
513025329
513025339
513025328
513025338
513025327
513025337
513025323
513025331
513025341
513025330
513025340
513025329
513025339
513025328
513025338
513025324
513025332
513025342
513025331
513025341
513025330
513025340
513025329
513025339
513025325
513025333
513025343
513025332
513025342
513025331
513025341
513025330
513025340
513025326
513025334
513025344
513025333
513025343
513025332
513025342
513025331
513025341
513025327
513025335
513025345
513025334
513025344
513025333
513025343
513025332
513025342
513025328
513025336
513025346
513025335
513025345
513025334
513025344
513025333
513025343
513025329
513025337
513025347
513025336
513025346
513025335
513025345
513025334
513025344
513025330
513025338
513025348
513025337
513025347
513025336
513025346
513025335
513025345
513025331
513025339
513025349
513025338
513025348
513025337
513025347
513025336
513025346
513025332
513025340
513025350
513025339
513025349
513025338
513025348
513025337
513025347
513025333
513025341
513025351
513025340
513025350
513025339
513025349
513025338
513025348
513025334
513025342
513025352
513025341
513025351
513025340
513025350
513025339
513025349
513025335
513025343
513025353
513025342
513025352
513025341
513025351
513025340
513025350
513025336
513025344
513025354
513025343
513025353
513025342
513025352
513025341
513025351
513025337
513025345
513025355
513025344
513025354
513025343
513025353
513025342
513025352
513025338
513025346
513025356
513025345
513025355
513025344
513025354
513025343
513025353
513025339
513025347
513025357
513025346
513025356
513025345
513025355
513025344
513025354
513025340
513025348
513025358
513025347
513025357
513025346
513025356
513025345
513025355
513025341
513025349
513025359
513025348
513025358
513025347
513025357
513025346
513025356
513025342
513025350
513025360
513025349
513025359
513025348
513025358
513025347
513025357
513025343
513025351
513025361
513025350
513025360
513025349
513025359
513025348
513025358
513025344
513025352
513025362
513025351
513025361
513025350
513025360
513025349
513025359
513025345
513025353
513025363
513025352
513025362
513025351
513025361
513025350
513025360
513025346
513025354
513025364
513025353
513025363
513025352
513025362
513025351
513025361
513025347
513025355
513025365
513025354
513025364
513025353
513025363
513025352
513025362
513025348
513025356
513025366
513025355
513025365
513025354
513025364
513025353
513025363
513025349
513025357
513025367
513025356
513025366
513025355
513025365
513025354
513025364
513025350
513025358
513025368
513025357
513025367
513025356
513025366
513025355
513025365
513025351
513025359
513025369
513025358
513025368
513025357
513025367
513025356
513025366
513025352
513025360
513025370
513025359
513025369
513025358
513025368
513025357
513025367
513025353
513025361
513025371
513025360
513025370
513025359
513025369
513025358
513025368
513025354
513025362
513025372
513025361
513025371
513025360
513025370
513025359
513025369
513025355
513025363
513025373
513025362
513025372
513025361
513025371
513025360
513025370
513025356
513025364
513025374
513025363
513025373
513025362
513025372
513025361
513025371
513025357
513025365
513025375
513025364
513025374
513025363
513025373
513025362
513025372
513025358
513025366
513025376
513025365
513025375
513025364
513025374
513025363
513025373
513025359
513025367
513025377
513025366
513025376
513025365
513025375
513025364
513025374
513025360
513025368
513025378
513025367
513025377
513025366
513025376
513025365
513025375
513025361
513025369
513025379
513025368
513025378
513025367
513025377
513025366
513025376
513025362
513025370
513025380
513025369
513025379
513025368
513025378
513025367
513025377
513025363
513025371
513025381
513025370
513025380
513025369
513025379
513025368
513025378
513025364
513025372
513025382
513025371
513025381
513025370
513025380
513025369
513025379
513025365
513025373
513025383
513025372
513025382
513025371
513025381
513025370
513025380
513025366
513025374
513025384
513025373
513025383
513025372
513025382
513025371
513025381
513025367
513025375
513025385
513025374
513025384
513025373
513025383
513025372
513025382
513025368
513025376
513025386
513025375
513025385
513025374
513025384
513025373
513025383
513025369
513025377
513025387
513025376
513025386
513025375
513025385
513025374
513025384
513025370
513025378
513025388
513025377
513025387
513025376
513025386
513025375
513025385
513025371
513025379
513025389
513025378
513025388
513025377
513025387
513025376
513025386
513025372
513025380
513025390
513025379
513025389
513025378
513025388
513025377
513025387
513025373
513025381
513025391
513025380
513025390
513025379

你可以看到,我不工作,我知道整个设计有缺陷,但我无法找到,我确实会感谢你的帮助。

a summary of the problem -

I have a diff list [8,18,7,17,6,16,5,15] the program starts with getting an id, each time I need to check the next id which is - id + diff[i] (i=0) if (id + diff[i]) isn t a valid id I m checking the next id which is (id + diff[i+1]).

如果该电离有效(id + diff[i.n]) I m id by 1, andeck if id+1 is valid id, if not I m eck re with id + diff[i.n], until I ll found the next valid id.

in each iteration I m checking ids I ve already checked in the previous iteration (which costs a lot of time and is inefficient), I need to avoid checking the IDs I ve already checked and keep increasing the ID until I ll find the next valid ID.

id= 1(且有效)和 d=[8,18,7,17,6,16,5,15]。

first iteration will look like (I m marking with bold the id s which I could avoid checking) - first - id = 1

9、19、8、18、7、17、6、16、2

二、背 景

www.un.org/Depts/DGACM/index_chinese.htm

第三类:id=3

http://www.un.org/Depts/DGACM/index_chinese.htm

第四编:id=4

<>12><22>>>>-BINGO, 下一个有效地址是22!

这给我提出了29项要求,而不是17项,而我只是一个小例子,其范围从最后有效补贴的300-600英亩。

我可以拿手法,避免以聪明和有效的方式检查以前检查过的婴儿。

感谢!

最佳回答

我认为,我已经这样做。

第一,该守则基于你的想法:

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    SEEN = set()

    while True:
        if ID>lastResult:
            print ( 
========================== 
                    
ID==%s 
                    
   ID>lastResult is %s : program STOPS )
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time。sleep(0。5)
        if ID in SEEN:
            print  -----------------
ID== +str(ID)+   already seen, not examined 
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print  --------------------------
ID== +str(ID)+   vaaaalid 
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time。sleep(0。5)
                        curRes = isValid(ID+i)
                        SEEN。add(ID+i)
                        if curRes:
                            print     i==%2s   ID+i==%s   valid  % (i,ID+i)
                            ID += i
                            print  --------------------------
ID==%s  % str(ID)
                            break
                        else:
                            print     i==%2s   ID+i==%s   not valid  % (i,ID+i)
                    else:
                        ID += 1
                        break
            else:
                print  --------------------------
ID==%s  not valid  % ID
                ID += 1


def isValid(ID, valid_ones = (1,9,17,25,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

结果

--------------------------
ID==0  not valid
--------------------------
ID==1  vaaaalid
   i== 8   ID+i==9   valid
--------------------------
ID==9
   i== 8   ID+i==17   valid
--------------------------
ID==17
   i== 8   ID+i==25   valid
--------------------------
ID==25
   i== 8   ID+i==33   not valid
   i==18   ID+i==43   not valid
   i== 7   ID+i==32   not valid
   i==17   ID+i==42   not valid
   i== 6   ID+i==31   not valid
   i==16   ID+i==41   not valid
   i== 5   ID+i==30   not valid
   i==15   ID+i==40   not valid
--------------------------
ID==26  not valid
--------------------------
ID==27  not valid
--------------------------
ID==28  not valid
--------------------------
ID==29  not valid
-----------------
ID==30  already seen, not examined
-----------------
ID==31  already seen, not examined
-----------------
ID==32  already seen, not examined
-----------------
ID==33  already seen, not examined
--------------------------
ID==34  not valid
--------------------------
ID==35  not valid
--------------------------
ID==36  not valid
--------------------------
ID==37  not valid
--------------------------
ID==38  not valid
--------------------------
ID==39  not valid
-----------------
ID==40  already seen, not examined
-----------------
ID==41  already seen, not examined
-----------------
ID==42  already seen, not examined
-----------------
ID==43  already seen, not examined
--------------------------
ID==44  not valid
--------------------------
ID==45  not valid
--------------------------
ID==46  not valid
--------------------------
ID==47  not valid
--------------------------
ID==48  not valid
--------------------------
ID==49  not valid
--------------------------
ID==50  vaaaalid
   i== 8   ID+i==58   not valid
   i==18   ID+i==68   not valid
   i== 7   ID+i==57   not valid
   i==17   ID+i==67   not valid
   i== 6   ID+i==56   not valid
   i==16   ID+i==66   not valid
   i== 5   ID+i==55   not valid
   i==15   ID+i==65   not valid
--------------------------
ID==51  not valid
--------------------------
ID==52  vaaaalid
   i== 8   ID+i==60   valid
--------------------------
ID==60
   i== 8   ID+i==68   not valid
   i==18   ID+i==78   not valid
   i== 7   ID+i==67   not valid
   i==17   ID+i==77   not valid
   i== 6   ID+i==66   not valid
   i==16   ID+i==76   not valid
   i== 5   ID+i==65   not valid
   i==15   ID+i==75   not valid
--------------------------
ID==61  not valid
--------------------------
ID==62  not valid
--------------------------
ID==63  not valid
--------------------------
ID==64  not valid
-----------------
ID==65  already seen, not examined
-----------------
ID==66  already seen, not examined
-----------------
ID==67  already seen, not examined
-----------------
ID==68  already seen, not examined
--------------------------
ID==69  not valid
--------------------------
ID==70  not valid
--------------------------
ID==71  not valid
--------------------------
ID==72  not valid
--------------------------
ID==73  not valid
--------------------------
ID==74  not valid
-----------------
ID==75  already seen, not examined
-----------------
ID==76  already seen, not examined
-----------------
ID==77  already seen, not examined
-----------------
ID==78  already seen, not examined
--------------------------
ID==79  not valid
--------------------------
ID==80  not valid
--------------------------
ID==81  not valid
--------------------------
ID==82  not valid
--------------------------
ID==83  vaaaalid
   i== 8   ID+i==91   not valid
   i==18   ID+i==101   not valid
   i== 7   ID+i==90   not valid
   i==17   ID+i==100   not valid
   i== 6   ID+i==89   not valid
   i==16   ID+i==99   not valid
   i== 5   ID+i==88   not valid
   i==15   ID+i==98   valid
--------------------------
ID==98
   i== 8   ID+i==106   not valid
   i==18   ID+i==116   not valid
   i== 7   ID+i==105   not valid
   i==17   ID+i==115   not valid
   i== 6   ID+i==104   not valid
   i==16   ID+i==114   not valid
   i== 5   ID+i==103   not valid
   i==15   ID+i==113   not valid
-----------------
ID==99  already seen, not examined
-----------------
ID==100  already seen, not examined

==========================
ID==101
   ID>lastResult is True : program STOPS

在这里,根据我的想法制定法典:

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    maxdiff = max(diff)
    others = [x for x in xrange(1,maxdiff) if x not in diff]
    lastothers = others[-1]
    SEEN = set()

    while True:
        if ID>lastResult:
            print ( 
========================== 
                    
ID==%s 
                    
   ID>lastResult is %s : program STOPS )
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time。sleep(0。5)
        if ID in SEEN:
            print  -----------------
ID== +str(ID)+   already seen, not examined 
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print  ------------------------------------
ID== +str(ID)+   vaaaalid 
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time。sleep(0。5)
                        curRes = isValid(ID+i)
                        SEEN。add(ID+i)
                        if curRes:
                            print     i==%2s   ID+i==%s   valid  % (i,ID+i)
                            ID += i
                            print  --------------------------
ID==%s  % str(ID)
                            break
                        else:
                            print     i==%2s   ID+i==%s   not valid  % (i,ID+i)
                    else:
                        for j in others:
                            if ID+j>lastResult:
                                print  
   j==%2s   %s+%s==%s>lastResult==%s is %s  
                                      % (j,ID,j,ID+j,lastResult,ID+j>lastResult)
                                ID += j
                                print  
--------------------------
now ID== ,ID
                                break
                            runs += 1
                            if runs % 10 == 0:  time。sleep(0。5)
                            curRes = isValid(ID+j)
                            SEEN。add(ID+j)
                            if curRes:
                                print     j==%2s   ID+j==%s   valid  % (j,ID+j)
                                ID += j
                                print  --------------------------
ID== +str(ID)
                                break
                            else:
                                print     j==%2s   ID+j==%s   not valid  % (j,ID+j)

                        if j==lastothers:
                            ID += maxdiff + 1
                            print     ID += %s + 1 ==> ID==%s  % (maxdiff,ID)
                            break
                        elif ID>lastResult:
                            print     ID>lastResult==%s>%s is %s ==> WILL STOP  % (ID,lastResult,ID>lastResult)
                            break

            else:
                print  -------------------------
ID== +str(ID)+   not valid 
                ID += 1




def isValid(ID, valid_ones = (1,9,17,25,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

结果

-------------------------
ID==0  not valid
------------------------------------
ID==1  vaaaalid
   i== 8   ID+i==9   valid
--------------------------
ID==9
   i== 8   ID+i==17   valid
--------------------------
ID==17
   i== 8   ID+i==25   valid
--------------------------
ID==25
   i== 8   ID+i==33   not valid
   i==18   ID+i==43   not valid
   i== 7   ID+i==32   not valid
   i==17   ID+i==42   not valid
   i== 6   ID+i==31   not valid
   i==16   ID+i==41   not valid
   i== 5   ID+i==30   not valid
   i==15   ID+i==40   not valid
   j== 1   ID+j==26   not valid
   j== 2   ID+j==27   not valid
   j== 3   ID+j==28   not valid
   j== 4   ID+j==29   not valid
   j== 9   ID+j==34   not valid
   j==10   ID+j==35   not valid
   j==11   ID+j==36   not valid
   j==12   ID+j==37   not valid
   j==13   ID+j==38   not valid
   j==14   ID+j==39   not valid
   ID += 18 + 1 ==> ID==44
-------------------------
ID==44  not valid
-------------------------
ID==45  not valid
-------------------------
ID==46  not valid
-------------------------
ID==47  not valid
-------------------------
ID==48  not valid
-------------------------
ID==49  not valid
------------------------------------
ID==50  vaaaalid
   i== 8   ID+i==58   not valid
   i==18   ID+i==68   not valid
   i== 7   ID+i==57   not valid
   i==17   ID+i==67   not valid
   i== 6   ID+i==56   not valid
   i==16   ID+i==66   not valid
   i== 5   ID+i==55   not valid
   i==15   ID+i==65   not valid
   j== 1   ID+j==51   not valid
   j== 2   ID+j==52   valid
--------------------------
ID==52
   i== 8   ID+i==60   valid
--------------------------
ID==60
   i== 8   ID+i==68   not valid
   i==18   ID+i==78   not valid
   i== 7   ID+i==67   not valid
   i==17   ID+i==77   not valid
   i== 6   ID+i==66   not valid
   i==16   ID+i==76   not valid
   i== 5   ID+i==65   not valid
   i==15   ID+i==75   not valid
   j== 1   ID+j==61   not valid
   j== 2   ID+j==62   not valid
   j== 3   ID+j==63   not valid
   j== 4   ID+j==64   not valid
   j== 9   ID+j==69   not valid
   j==10   ID+j==70   not valid
   j==11   ID+j==71   not valid
   j==12   ID+j==72   not valid
   j==13   ID+j==73   not valid
   j==14   ID+j==74   not valid
   ID += 18 + 1 ==> ID==79
-------------------------
ID==79  not valid
-------------------------
ID==80  not valid
-------------------------
ID==81  not valid
-------------------------
ID==82  not valid
------------------------------------
ID==83  vaaaalid
   i== 8   ID+i==91   not valid
   i==18   ID+i==101   not valid
   i== 7   ID+i==90   not valid
   i==17   ID+i==100   not valid
   i== 6   ID+i==89   not valid
   i==16   ID+i==99   not valid
   i== 5   ID+i==88   not valid
   i==15   ID+i==98   valid
--------------------------
ID==98
   i== 8   ID+i==106   not valid
   i==18   ID+i==116   not valid
   i== 7   ID+i==105   not valid
   i==17   ID+i==115   not valid
   i== 6   ID+i==104   not valid
   i==16   ID+i==114   not valid
   i== 5   ID+i==103   not valid
   i==15   ID+i==113   not valid
   j== 1   ID+j==99   not valid
   j== 2   ID+j==100   not valid

   j== 3   98+3==101>lastResult==100 is True

--------------------------
now ID== 101
   ID>lastResult==101>100 is True ==> WILL STOP

==========================
ID==101
   ID>lastResult is True : program STOPS

这里有:

    if ID in SEEN:
        print  -----------------
ID== +str(ID)+   already seen, not examined 
        ID += 1

in this code, but the message already seen is never printed during the execution; however, the detection of valids ID has the same 结果; that means that the use of a set SEEN isn t necessary in my code。

Edit 1

第1号法律指示经常空出的SEEN

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    SEEN = set()
    while True:
        if ID>lastResult:
            print ( 
========================== 
                    
ID==%s 
                    
   ID>lastResult is %s : program STOPS )
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time。sleep(0。5)
        if ID in SEEN:
            print  -----------------
%s
ID==%s  already seen, not examined  % (SEEN,ID)
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print  --------------------------
%s
ID==%s  vaaaalid   % (SEEN,ID)
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time。sleep(0。5)
                        curRes = isValid(ID+i)
                        print      +str(SEEN)
                        if i==diff[0]:
                            SEEN = set([ID+i])
                        else:
                            SEEN。add(ID+i)
                        if curRes:
                            print     i==%2s   ID+i==%s   valid  % (i,ID+i)
                            ID += i
                            print  --------------------------
ID==%s  % str(ID)
                            break
                        else:
                            print     i==%2s   ID+i==%s   not valid  % (i,ID+i)
                    else:
                        ID += 1
                        break
            else:
                print  --------------------------
%s
ID==%s  not vaaaaalid  % (SEEN,ID)
                ID += 1


def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

结果

--------------------------
set([])
ID==0  not vaaaaalid
--------------------------
set([])
ID==1  vaaaalid
   set([])
   i== 8   ID+i==9   valid
--------------------------
ID==9
   set([9])
   i== 8   ID+i==17   valid
--------------------------
ID==17
   set([17])
   i== 8   ID+i==25   valid
--------------------------
ID==25
   set([25])
   i== 8   ID+i==33   not valid
   set([33])
   i==18   ID+i==43   not valid
   set([33, 43])
   i== 7   ID+i==32   not valid
   set([32, 33, 43])
   i==17   ID+i==42   not valid
   set([32, 33, 42, 43])
   i== 6   ID+i==31   not valid
   set([32, 33, 42, 43, 31])
   i==16   ID+i==41   not valid
   set([32, 33, 41, 42, 43, 31])
   i== 5   ID+i==30   valid
--------------------------
ID==30
   set([32, 33, 41, 42, 43, 30, 31])
   i== 8   ID+i==38   not valid
   set([38])
   i==18   ID+i==48   not valid
   set([48, 38])
   i== 7   ID+i==37   not valid
   set([48, 37, 38])
   i==17   ID+i==47   not valid
   set([48, 37, 38, 47])
   i== 6   ID+i==36   not valid
   set([48, 36, 37, 38, 47])
   i==16   ID+i==46   not valid
   set([36, 37, 38, 46, 47, 48])
   i== 5   ID+i==35   not valid
   set([35, 36, 37, 38, 46, 47, 48])
   i==15   ID+i==45   not valid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==31  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==32  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==33  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==34  not vaaaaalid
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==35  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==36  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==37  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==38  already seen, not examined
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==39  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==40  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==41  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==42  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==43  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==44  not vaaaaalid
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==45  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==46  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==47  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==48  already seen, not examined
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==49  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==50  vaaaalid
   set([35, 36, 37, 38, 45, 46, 47, 48])
   i== 8   ID+i==58   not valid
   set([58])
   i==18   ID+i==68   not valid
   set([58, 68])
   i== 7   ID+i==57   not valid
   set([57, 58, 68])
   i==17   ID+i==67   not valid
   set([57, 58, 67, 68])
   i== 6   ID+i==56   not valid
   set([56, 57, 58, 67, 68])
   i==16   ID+i==66   not valid
   set([66, 67, 68, 56, 57, 58])
   i== 5   ID+i==55   not valid
   set([66, 67, 68, 55, 56, 57, 58])
   i==15   ID+i==65   not valid
--------------------------
set([65, 66, 67, 68, 55, 56, 57, 58])
ID==51  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 55, 56, 57, 58])
ID==52  vaaaalid
   set([65, 66, 67, 68, 55, 56, 57, 58])
   i== 8   ID+i==60   valid
--------------------------
ID==60
   set([60])
   i== 8   ID+i==68   not valid
   set([68])
   i==18   ID+i==78   not valid
   set([68, 78])
   i== 7   ID+i==67   not valid
   set([67, 68, 78])
   i==17   ID+i==77   not valid
   set([67, 68, 77, 78])
   i== 6   ID+i==66   not valid
   set([66, 67, 68, 77, 78])
   i==16   ID+i==76   not valid
   set([66, 67, 68, 76, 77, 78])
   i== 5   ID+i==65   not valid
   set([65, 66, 67, 68, 76, 77, 78])
   i==15   ID+i==75   not valid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==61  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==62  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==63  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==64  not vaaaaalid
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==65  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==66  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==67  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==68  already seen, not examined
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==69  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==70  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==71  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==72  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==73  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==74  not vaaaaalid
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==75  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==76  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==77  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==78  already seen, not examined
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==79  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==80  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==81  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==82  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==83  vaaaalid
   set([65, 66, 67, 68, 75, 76, 77, 78])
   i== 8   ID+i==91   not valid
   set([91])
   i==18   ID+i==101   not valid
   set([91, 101])
   i== 7   ID+i==90   not valid
   set([90, 91, 101])
   i==17   ID+i==100   not valid
   set([90, 91, 100, 101])
   i== 6   ID+i==89   not valid
   set([89, 90, 91, 100, 101])
   i==16   ID+i==99   not valid
   set([99, 100, 101, 89, 90, 91])
   i== 5   ID+i==88   not valid
   set([99, 100, 101, 88, 89, 90, 91])
   i==15   ID+i==98   valid
--------------------------
ID==98
   set([98, 99, 100, 101, 88, 89, 90, 91])
   i== 8   ID+i==106   not valid
   set([106])
   i==18   ID+i==116   not valid
   set([106, 116])
   i== 7   ID+i==105   not valid
   set([105, 106, 116])
   i==17   ID+i==115   not valid
   set([105, 106, 115, 116])
   i== 6   ID+i==104   not valid
   set([104, 105, 106, 115, 116])
   i==16   ID+i==114   not valid
   set([104, 105, 106, 114, 115, 116])
   i== 5   ID+i==103   not valid
   set([103, 104, 105, 106, 114, 115, 116])
   i==15   ID+i==113   not valid
--------------------------
set([103, 104, 105, 106, 113, 114, 115, 116])
ID==99  not vaaaaalid
--------------------------
set([103, 104, 105, 106, 113, 114, 115, 116])
ID==100  not vaaaaalid

==========================
ID==101
   ID>lastResult is True : program STOPS

This above code shows the process concerning the recording and emptying the values of ID already seen。 It is a code as good as it is conceivable, since the algorithm comprises the periodical emptying of SEEN, because it s true that emptying is necessary, given the number of IDs to te be tested。

But from the beginning, my opinion was that this process concernin SEEN in this algorithm weighs on the performance, because recording and testing instructions concerning a SEEN are repeatedly executed at every step of the program。

That s why I imagined there should be another algorithm having not this drawback。 I finally succeeded to write such an alternative code, and now we have two codes having two different algorithms。

Concerning your question "are you sure it s not necessary to use the SEEN logic in the second?"
I answer yes, I think I can be sure 。 The intent of runing my code #2 with instructions managing SEEN was to make me sure after verifying what was a mental idea and a conceptual algorithm。 If you want to become sure, you must do the same: - study conceptually and precisely the algorithm - do as many essays of execution of the two codes and comparison of their 结果s as you need to be experimentally convinced, varying the values of lastResult, valid_ones and diff For me , this point is closed as long as there won t be a contradictory practical case proving that my conclusion is wrong。

I continue in the other answer, because the number of characters is limited in this one

问题回答

如果想要在作出修改的职能之外引起对其的修改,则将识别符号定为global

Hence it is an aberration to make lastResult and curRes global:

  • the first, lastResult, because it is a constant in your complete code。 The best is to define a parameter lastResult of the function checkNextID() with the value of lastResult as default argument。

  • 第二,curRes,因为在checkNextID()中,该标识没有改动。

现在,将/curRes定义为global,载于isValid()这一功能中,也是一种不好的做法,但并非是一种错误做法: 1 curRes的新价值从isValid(<>/strong>)内发送至外部;2)之后,该方案超出了<>strong>checkNextID()<查询/curRes的价值。 有了这种无休止和无用的绕行,你可以请<>免费<><>>>(见checkNextID()的职能中“release,而且这一职能将自动超出此名称的解决和获得其价值。

Personally , I prefer to reorganise the general algorithm。 In my following code, curRes is defined as a local object, taking directly its value from the return of the function isValid() 。 That requires to redefine isValid(): in my code isValid() returns the object soup or False

I hope I understood your need。 Say me what s wrong in my approach, please。

def checkNextID(ID, lastResult = lastResult, diff = [0,1,5,6,7,8,15,16,17,18]):
    runs = 0
    maxdiff = max(diff)
    diff。extend(x for x in xrange(maxdiff) if x not in diff)
    while True:
        for i in diff:
            if ID+i==lastResult:  break
            runs += 1
            if runs % 10 == 0:  time。sleep(6)
            curRes = isValid(ID+i):
            if cuRes:
                parseHTML(curRes, ID+i)
                ID = ID + i
                break
        else:
            runs += 1
            ID += maxdiff + 1
            if ID==lastResult:  break






def isValid(ID, urlhead = urlPath):
    # this function return either False OR a BeautifulSoup instance
    try:
        page = getPAGE(urlhead + str(ID))
        if page == False:  return False
    except Exception, e:
        print "An error occured in the first Exception block of parseHTML : " + str(e) +  address:   + address
    else:
        try:
            soup = BeautifulSoup(page)
        except TypeError, e:
            print "An error occured in the second Exception block of parseHTML : " + str(e) +  address:   + address
            return False
        try:
            companyID = soup。find( span ,id= lblCompanyNumber )。string
            if (companyID == None): #if lblCompanyNumber is None we can assume that we don t have the content we want, save in the bad log file
                saveToCsv(ID, isEmpty = True)
                return False
            else:
                return soup #we have the data we need, save the soup obj to a global variable
        except Exception, e:
            print "Error while parsing this page, third exception block: " + str(e) +   id:   + address
            return False

此外,为了加快方案:

  • 您应使用regex tool (module re),而不是使用比使用reg慢10倍的美容。

  • 页: 1 每项职能要求都比直接守则需要额外时间

Final Edit

To conclude this long study of your problem, I did a benchmark。 Here after follow the codes and the results that show that my intuition is borne out: my code #2 takes at least 20 % less time to run than your code #1 。 Your code #1:

from time import clock

lastResult = 200

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    SEEN = set()
    li = []

    while True:
        if ID>lastResult:
            break
        if ID in SEEN:
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                li。append(ID)
                while True:
                    for i in diff:
                        curRes = isValid(ID+i)
                        if i==diff[0]:
                            SEEN = set([ID+i])
                        else:
                            SEEN。add(ID+i)
                        if curRes:
                            li。append(ID+i)
                            ID += i
                            break
                    else:
                        ID += 1
                        break
            else:
                ID += 1

    return li


def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)):
    return ID in valid_ones

te = clock()
for i in xrange(10000):
    checkNextID(0)
print clock()-te, seconds 
print checkNextID(0)

我的法典2

from time import clock

lastResult = 200

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    maxdiff = max(diff)
    others = [x for x in xrange(1,maxdiff) if x not in diff]
    lastothers = others[-1]

    li = []

    while True:
        if ID>lastResult:
            break
        else:
            curRes = isValid(ID)
            if curRes:
                li。append(ID)
                while True:
                    for i in diff:
                        curRes = isValid(ID+i)
                        if curRes:
                            li。append(ID+i)
                            ID += i
                            break
                    else:
                        for j in others:
                            if ID+j>lastResult:
                                ID += j
                                break
                            curRes = isValid(ID+j)
                            if curRes:
                                li。append(ID+j)
                                ID += j
                                break

                        if j==lastothers:
                            ID += maxdiff + 1
                            break
                        elif ID>lastResult:
                            break

            else:
                ID += 1

    return li




def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)):
    return ID in valid_ones

te = clock()
for i in xrange(10000):
    checkNextID(0)
print clock()-te, seconds 
print checkNextID(0)

成果:

your code
0。398804596674 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]

my code
0。268061164198 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]

0。268061164198 / 0。398804596674 = 67。3 %

I ve tried also with lastResult = 100 , I got 72 % 。
And with lastResult = 480, I got 80 %。

首先,你应将工作分成两个进程。 一种用于确定有效胎体,另一种用于检索数据。

确定有效胎盘的只需要使用http:// HEAD指令,并且能够比检索页面更快地工作。

在检查网页时,在你核对<编码>diff上的贴现后,在导致你开始使用<编码>diff的封面上添加18。 甚至可以记录只通过<编码>diff进行部分检查的幅度,然后在程序结束时再回来,并检查所有范围。

如果你能够跳出任何ids子,那么就把最后一批核子藏起来,然后检查一下哪里,那儿的 n子等于透镜。 a. 采用环绕缓冲手段:

nextelem = 0
......
# check before retrieving
if not id in ringbuff:
    #retrieve an id
    ringbuf[nextelem] = id
    nextelem += 1
    if nextelem > len[ringbuff]:
        nextelem = 0

......

On the surface of it, a simple loop like this should check all ids:

for id in xrange(1000000):
    checkpage(id)

This would check every possible page. But you want to read ahead when you get a hit, and also backtrack partially if I understand correctly. In any case, what you are fundamentally doing is changing the range of ids from the simple sequence returned by xrange, so I think you need to write a generator and do this instead:

for id in myrange(1000000):
    checkpage(id)

You might still want to use a ringbuffer, depending on what you do within that range of 18 possible additional hits. If you need to check all of the possibilities in diff and then go back to something less than the maximum element in diff, then the ring buffer would be useful in checkpage.

但骗局是写我的奇怪。

def myrange(maxnum):
    global hitfound
    global nextnum
    global diff
    curnum = 0
    while curnum < maxnum:
        yield curnum
        if hitfound:
            nextnum = curnum
            hitnum = curnum
            for e in diff:
                yield hitnum + e
            curnum = nextnum - 1
        curnum += 1

三个全球变量使你们能够影响婴儿的范围。 如果你在<条码>内设置<条码>,>真实,只要你有良好的页码,你就会影响我开始在<条码>diff上适用加插的日程。 然后,你可以制定<条码>nextnum,以便在你开始应用电梯加固后开始加固。 例如,你可能决定将其定为比检查iff升时发现的首例(或最后一次)高1次。 或者,你可以单独离开,利用环缓冲,确保你不再要求任何宽松的页数。

我建议,你照我前面的法典,单独测试宽减逻辑。 Tweak the generator myrange(>) until it production the right series, and subsequently pop it into their web wasteing program.





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签