An obvious simple solution is to run strip_tags() over it, but that would simply remove tags and leave all text content intact, including embedded javascript and CSS, as well as all text inside elements that are normally hidden (e.g. by setting display: none on them). You could try some regex magic to filter out the parts you're not interested in, but regular expressions on HTML are generally a bad idea for anything nontrivial. The ultimate solution is, I'm afraid, to use a proper HTML parser and then pull the actual text out of the resulting DOM tree - by the time you have that, you'll be pretty close to implementing a web browser.
form:
http://stackoverflow.com/questions/5572469/get-html-output-cleaned-text-with-php
form:
http://stackoverflow.com/questions/5572469/get-html-output-cleaned-text-with-php
评论
发表评论