Question

I need to compare some theoretical data with real data in python. The theoretical data comes from resolving an equation. To improve the comparative I would like to remove data points that fall far from the theoretical curve. I mean, I want to remove the points below and above red dashed lines in the figure (made with matplotlib). Data points and theoretical curves

理论曲线和数据点都是不同长度的数组。

我可以尝试以粗略的方式去除这些点，例如：可以使用以下方法检测第一个上点：

data2[(data2.redshift<0.4)&data2.dmodulus>1]
rec.array([( 1997o , 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[( SN_name ,  |S10 ), ( redshift ,  <f8 ), ( dmodulus ,  <f8 ), ( dmodulus_error ,  <f8 )])

但我想用一种不那么粗暴的眼神。

那么，有人能帮我找到一种简单的方法来消除问题点吗？

非常感谢。

Answer 1

这可能有些过头了，而且是基于你的评论

Both the theoretical curves and the data points are arrays of different length.

我会做以下事情：

Truncate the data set so that its x values lie within the max and min values of the theoretical set.
Interpolate the theoretical curve using scipy.interpolate.interp1d and the above truncated data x values. The reason for step (1) is to satisfy the constraints of interp1d.
Use numpy.where to find data y values that are out side the range of acceptable theory values.
DONT discard these values, as was suggested in comments and other answers. If you want for clarity, point them out by plotting the inliners one color and the outliers an other color.

我想这是一个接近你想要的剧本。希望它能帮助你完成你想要的：

import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt

# make up data
def makeUpData():
       Make many more data points (x,y,yerr) than theory (x,y),
    with theory yerr corresponding to a constant "sigma" in y, 
    about x,y value   
    NX= 150
    dataX = (np.random.rand(NX)*1.1)**2
    dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX
    dataErr = np.random.rand(NX)*dataX*1.3
    theoryX = np.arange(0,1,0.1)
    theoryY = theoryX*theoryX*1.5
    theoryErr = 0.5
    return dataX,dataY,dataErr,theoryX,theoryY,theoryErr

def makeSameXrange(theoryX,dataX,dataY):
       
    Truncate the dataX and dataY ranges so that dataX min and max are with in
    the max and min of theoryX.
       
    minT,maxT = theoryX.min(),theoryX.max()
    goodIdxMax = np.where(dataX<maxT)
    goodIdxMin = np.where(dataX[goodIdxMax]>minT)
    return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin]

# take  theory  and get values at every  data  x point
def theoryYatDataX(theoryX,theoryY,dataX):
       For every dataX point, find interpolated thoeryY value. theoryx needed
    for interpolation.   
    f = interpolate.interp1d(theoryX,theoryY)
    return f(dataX[np.where(dataX<np.max(theoryX))])

# collect valid points
def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr):
       Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
    valid indicies.   
    withinUpper = np.where(dataY<(interpTheoryY+theoryErr))
    withinLower = np.where(dataY[withinUpper]
                    >(interpTheoryY[withinUpper]-theoryErr))
    return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower]

def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr):
       Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
    valid indicies.   
    withinUpper = np.where(dataY>(interpTheoryY+theoryErr))
    withinLower = np.where(dataY<(interpTheoryY-theoryErr))
    return (dataX[withinUpper],dataY[withinUpper],
            dataX[withinLower],dataY[withinLower])
if __name__ == "__main__":

    dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData()

    TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY)

    interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX)

    inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY,
                                    theoryErr)

    outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX,
                                                    TruncDataY,
                                                    interpTheoryY,
                                                    theoryErr)
    #print inlierIndex
    fig = plt.figure()
    ax = fig.add_subplot(211)

    ax.errorbar(dataX,dataY,dataErr,fmt= . ,color= k )
    ax.plot(theoryX,theoryY, r- )
    ax.plot(theoryX,theoryY+theoryErr, r-- )
    ax.plot(theoryX,theoryY-theoryErr, r-- )
    ax.set_xlim(0,1.4)
    ax.set_ylim(-.5,3)
    ax = fig.add_subplot(212)

    ax.plot(inDataX,inDataY, ko )
    ax.plot(outUpX,outUpY, bo )
    ax.plot(outDownX,outDownY, ro )
    ax.plot(theoryX,theoryY, r- )
    ax.plot(theoryX,theoryY+theoryErr, r-- )
    ax.plot(theoryX,theoryY-theoryErr, r-- )
    ax.set_xlim(0,1.4)
    ax.set_ylim(-.5,3)
    fig.savefig( findInliers.png )

This figure is the result: enter image description here

Answer 2

最后，我使用了一些Yann代码：

def theoryYatDataX(theoryX,theoryY,dataX):
   For every dataX point, find interpolated theoryY value. theoryx needed
for interpolation.   
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])

def findOutlierSet(data,interpTheoryY,theoryErr):
       Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
    valid indicies.   

    up = np.where(data.dmodulus > (interpTheoryY+theoryErr))
    low = np.where(data.dmodulus < (interpTheoryY-theoryErr))
    # join all the index together in a flat array
    out = np.hstack([up,low]).ravel()

    index = np.array(np.ones(len(data),dtype=bool))
    index[out]=False

    datain = data[index]
    dataout = data[out]

    return datain, dataout

def selectdata(data,theoryX,theoryY):
    """
    Data selection: z<1 and +-0.5 LFLRW separation
    """
    # Select data with redshift z<1
    data1 = data[data.redshift < 1]

    # From modulus to light distance:
    data1.dmodulus, data1.dmodulus_error = modulus2distance(data1.dmodulus,data1.dmodulus_error)

    # redshift data order
    data1.sort(order= redshift )

    # Outliers: distance to LFLRW curve bigger than +-0.5
    theoryErr = 0.5
    # Theory curve Interpolation to get the same points as data
    interpy = theoryYatDataX(theoryX,theoryY,data1.redshift)

    datain, dataout = findOutlierSet(data1,interpy,theoryErr)
    return datain, dataout

使用这些函数，我最终可以获得：

谢谢大家的帮助。

Answer 3

只要看看红色曲线和点之间的差异，如果它大于红色曲线和红色虚线之间的差异就可以将其移除。

diff=np.abs(points-red_curve)
index= (diff>(dashed_curve-redcurve))
filtered=points[index]

但请认真对待尼克LH的评论。你的数据在没有任何过滤的情况下看起来很好，你的“outliers”都有很大的错误，不会对拟合产生太大影响。

Answer 4

您可以使用numpy.where（）来确定哪些xy对符合绘图标准，也可以枚举来做几乎相同的事情。示例：

x_list = [ 1,  2,  3,  4,  5,  6 ]
y_list = [ f , o , o , b , a , r ]

result = [y_list[i] for i, x in enumerate(x_list) if 2 <= x < 5]

print result

我相信你可以改变条件，使上面例子中的2和5是曲线的函数

友情链接