English 中文(简体)
Finding and removing orphaned web pages, images, and other related files
原标题:

I am working on a number of websites with files dating back to 2000. These sites have grown organically over time resulting in large numbers of orphaned web pages, include files, images, CSS files, JavaScript files, etc... These orphaned files cause a number of problems including poor maintainability, possible security holes, poor customer experience, and driving OCD/GTD freaks like myself crazy.

These files number in the thousands so a completely manual solution is not feasible. Ultimately, the cleanup process will require a fairly large QA effort in order to ensure we have not inadvertently deleted needed files but I am hoping to develop a technological solution to help speed the manual effort. Additionally, I hope to put processes/utilities in place to help prevent this state of disorganization from happening in the future.

Environment Considerations:

  • Classic ASP and .Net
  • Windows servers running IIS 6 and IIS 7
  • Multiple environments (Dev, Integration, QA, Stage, Prodction)
  • TFS for source control

Before I start I would like to get some feedback from others who have successfully navigated a similar process.

Specifically I am looking for:

  • Process for identifying and cleaning up orphaned files
  • Process for keeping environments clean from orphaned files
  • Utilities that help identify orphaned files
  • Utilities that help identify broken links (once files have been removed)

I am not looking for:

  • Solutions to my organizational OCD...I like how I am.
  • Snide comments about us still using classic ASP. I already feel the pain. There is no need to rub it in.
最佳回答

Step 1: Establish a list of pages on your site which are definitely visible. One intelligent way to create this list is to parse your log files for pages people visit.

Step 2: Run a tool that recursively finds site topology, starting from a specially written page (that you will make on your site) which has a link to each page in step 1. One tool which can do this is Xenu s Link Sleuth. It s intended for finding dead links, but it will list live links as well. This can be run externally, so there are no security concerns with installing weird software onto your server. You ll need to watch over this occasionally since your site may have infinite pages and the like if you have bugs or whatever.

Step 3: Run a tool that recursively maps your hard disk, starting from your site web directory. I can t think of any of these off the top of my head, but writing one should be trivial, and is safer since this will be run on your server.

Step 4: Take the results of steps 2 and 3 programmatically match #2 against #3. Anything in #3 not in #2 is potentially an orphan page.

Note: This technique works poorly with password-protected stuff, and also works poorly with sites relying heavily on dynamically generated links (dynamic content is fine if the links are consistent).

问题回答

At first I thought you could get away by scanning files for links, and then doing a diff against your folder structure - but this only identifies simple orphans, not collections of orphaned files that reference each other. So, using grep probably won t get you all the way there.

This isn t a trivial solution, but would make an excellent utility for keeping your environment clean (and therefore, worth the effort). Plus, you can re-use it across all environments (and share it with others!)

The basic idea is to setup and populate a directional graph where each node s key is an absolute path. This is done by scanning all the files and adding dependencies - for example:

/index.html     -> /subfolder/file.jpg
                -> /subfolder/temp.html
                -> /error.html
/temp.html      -> /index.html
/error.html     
/stray.html     -> /index.html
/abandoned.html

Then, you can identify all your "reachable" files by doing a BFS on your root page.

With the directional graph, you can also classify files by their in and out degree. In the example above:

/index.html     in: 1 out: 2
/temp.html      in: 1 out: 1
/error.html     in: 1 out: 0
/stray.html     in: 0 out: 1
/abandoned.html in: 0 out: 0

So, you re basically looking for files that have in = 0 that are abandoned.

Additionally, files that have out = 0 are going to be terminal pages; which may or may not be desirable on your site (as error suggests, it s an error page).

No snide comments here... I feel your pain as a large portion of our site is still in classic ASP.

I don t know of any fully automated systems that will be a magic bullet, but I dd have a couple of ideas for what could help. At least it s how we cleaned up our site.

First, although it hardly seems like the tool for such a job, I ve used Microsoft Viso to help with this. We have Visio for Enterprise Architects, and I am not sure if this feature is in other versions, but in this version, you can create a new document, and in the "choose drawing type" under the "Web Diagram" folder, there is an option for a "Web Site Map" (either Metric or US units - it doesn t matter).

When you create this drawing type, Visio prompts you for the URL of your web site, and then goes out and crawls your web site for you.

This should help to identify which files are valid. It s not perfect, but the way we used it was to find the files in the file system that did not show up in the Visio drawing, and then pull up the entire solution in Visual Studio and do a search for that file name. If we could not find it in the entire solution, we moved it off into an "Obsolete" folder for a month, and deleted it if we didn t start getting complaints or 404 errors on the web site.

Other possible solutions would be to use log file parser and parse your logs for the last n months and look for missing files this way, but that would essentially be a lot of coding to come up with a list of "known good" files that s really no better than the Visio option.

Been there, done that many times. Why can t the content types clean up after themselves? Personally, I d hit it something like this:

1) Get a copy of the site running in a QA environment.

2) Use selinum (or some other browser-based testing tool) to create a suite of tests for stuff that works.

3) Start deleting stuff that should be deleted.

4) Run tests from #2 after deleting stuff to insure it still works.

5) Repeat #s 3 & 4 until satisfied.





相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

How to Add script codes before the </body> tag ASP.NET

Heres the problem, In Masterpage, the google analytics code were pasted before the end of body tag. In ASPX page, I need to generate a script (google addItem tracker) using codebehind ClientScript ...

Transaction handling with TransactionScope

I am implementing Transaction using TransactionScope with the help this MSDN article http://msdn.microsoft.com/en-us/library/system.transactions.transactionscope.aspx I just want to confirm that is ...

System.Web.Mvc.Controller Initialize

i have the following base controller... public class BaseController : Controller { protected override void Initialize(System.Web.Routing.RequestContext requestContext) { if (...

Microsoft.Contracts namespace

For what it is necessary Microsoft.Contracts namespace in asp.net? I mean, in what cases I could write using Microsoft.Contracts;?

Separator line in ASP.NET

I d like to add a simple separator line in an aspx web form. Does anyone know how? It sounds easy enough, but still I can t manage to find how to do it.. 10x!

热门标签