General Question

tlm's avatar

Is there any way to rip out all text from a webpage?

Asked by tlm (475points) September 12th, 2011

I need to translate a webpage. I’d love to use Google Translate Toolkit for that, however, if you just feed the webpage to it the result is very ugly, and almost unusable. I’d like to just rip out all the text somehow. Due to the page design, copy-pasting isn’t likely to be possible. Any ideas?

Observing members: 0 Composing members: 0

9 Answers

wonderingwhy's avatar

This may be of limited use but you could try something like converting the page with readability.com or perhaps disabling images or styles (may result in ugliness but possibly with greater usability) within the browser then translate the results.

LostInParadise's avatar

If you have Microsoft OneNote, you can copy a screen capture image from the page and paste it into a OneNote document. Right click the image. There will be an option for copying the text in the image.

If you do not have a screen capture program, this is a freeware program that I find very useful.

If you do not have OneNote, you may be able to get a trial version from Microsoft.

the100thmonkey's avatar

You don’t need a screen capture program; all you need is MS Paint.

Is the site coded in Flash or HTML? If it’s HTML, then turning Javascript off should render all the ‘no right click’ settings useless. If it’s Flash, iunno.

tlm's avatar

It’s neither, it’s plain HTML. It’s just the fact that the page layout makes it VERY uncomfortable to copy things out, @the100thmonkey.

funkdaddy's avatar

Save off the html file locally, most likely it uses a css file for the layout and other customization of the appearance.

So navigate to your page and then go to the File menu, usually there’s a save option there. In that same menu there should be an “Open File” option (or something similar) that you can open the file you just saved with.

Once it’s separated from the CSS file it should render in a very basic manner which might be more readable after translation. If you keep it as an html file you can still open it in your browser and use the tools there.

As an alternative, you can view the source of the file (different ways to do this dependent on the browser you’re using, usually you can right click in a blank area of the screen and an option will be available or navigate through the menus) and then remove the html markup to leave you with just the text. A tool like this will make it easier.

dreamwolf's avatar

Right Click. Inspect Element, find the jackpot link.

tlm's avatar

@dreamwolf Just how did you know I use Opera? xD

LostInParadise's avatar

I don’t know why I did not think of this before. The Calibre document manager allows you to translate between format, including html

Answer this question

Login

or

Join

to answer.

This question is in the General Section. Responses must be helpful and on-topic.

Your answer will be saved while you login or join.

Have a question? Ask Fluther!

What do you know more about?
or
Knowledge Networking @ Fluther