Clean up pasted text from PDF's and other sources – Removing Hard Line Breaks

Removing Carriage Returns and Line Breaks

A common problem when gathering information from PDFs and other sources is that when we paste our copied text into our apps and programs, we see that the lines don’t run out nicely to the end but are cut off. It is because we copied the line breaks in the process, in spite of ourselves, without knowing.

  • A possible solution is trying to copy and paste differently: paste with or without formatting, as plain text; but that might not always be feasible.

Although there might be community plugins out there that can help with cleaning up our text, it’s better to take things in our own hands for more than one reason:

  • We can cut down on clutter caused by too many community plugins installed (“which was that thingy that did this…?”);
  • We can further customize clean-up jobs based on our use cases, languages, etc.
  • We can learn a new trick and take our own initiative next time around.

We are going to employ regular expressions for our text clean-ups.

Templater method

Install community plugin Templater, set where you keep your templates, etc. Basic stuff. Create a new note in your Templates folder. E.g. “Clean up Template”.
Copy and paste this there:

<%*
const editor = app.workspace.activeLeaf.view.editor;
// Put your rules here
function applyRules(text) {
    const rules = [
	{
            from: /\s+$/gm,
            to: "",
        },
        {
            from: /(\r\n)+|\r+|\n+/gm,
            to: " ",
        }
    ];
    for (const rule of rules) {
        text = text.replace(rule.from, rule.to);
    }
    return text;
}
// The text selected before running the template
const selText = editor.getSelection() || '';
// Effecting changes
const modifiedText = applyRules(selText);
editor.replaceSelection(`${modifiedText}`);
%>

In Templater, assign a hotkey to this template. You’ll be taken to Obsidian Hotkeys. Now this template will also be available as a command, should you want to put it on the Command Palette, Editing Toolbar, Mobile Toolbar, etc.

A variation also deletes hyphens, if in your language there are a lot of long words and words are often separated:

<%*
const editor = app.workspace.activeLeaf.view.editor;
// Put your rules here
function applyRules(text) {
    const rules = [
	{
            from: /\s+$/gm,
            to: "",
        },
        {
            from: /(\r\n)+|\r+|\n+/gm,
            to: " ",
        },
        {
            from: /([a-zžáàäæãééíóöőüűčñßðđŋħjĸłß])(-\s{1})([a-zžáàäæãééíóöőüűčñßðđŋħjĸłß])/gm,
            to: "$1$3",
        }
    ];
    for (const rule of rules) {
        text = text.replace(rule.from, rule.to);
    }
    return text;
}
// The text selected before running the template
const selText = editor.getSelection() || '';
// Effecting changes
const modifiedText = applyRules(selText);
editor.replaceSelection(`${modifiedText}`);
%>
  • This will also join compound words that are not supposed to be spelled without hyphens, so you need to proofread your text for any mistakes. Or just use the first template without the extra rule.

You can easily add more rules over time, if you know some regex or find something off of the internet or from a chat robot.

What the script does

Apart from the comments in the script, what this does:

  • Sets the work area as active leaf.
  • In a function, does the rule replacements in a loop.
    • Look how the from (“match”) and to (“replace”) rules follow one another. You can add more.
  • Using two variables we make the modifications and replace text in the editor.

Regex rules and what they do

Rule 1: Deletes any trailing whitespaces from the end of the lines.
Rule 2: Exchanges carriage returns/hard line breaks with a space character, effectively making your text flow continuously.
Rule 3: If there is a hyphen and a space between letters, it deletes them.

Apply Patterns plugin method

Again, install the plugin. Name the pattern like before: “Clean up text” or something. You add the rules – the same ones as above – one by one. Don’t forget to set global and multiline switches on each of them.

Rule one:

Matching text: \s+$
Replacement: `` (nothing; leave box empty)

Rule two:

Matching text: (\r\n)+|\r+|\n+
Replacement: (one space)

Rule three:

(if you need it)

Matching text: ([a-zžáàäæãééíóöőüűčñßðđŋħjĸłß])(-\s{1})([a-zžáàäæãééíóöőüűčñßðđŋħjĸłß])
Replacement: $1$3

Scroll down down to the Commands section and name the command the same name. In the Pattern Name Filter, you add the same name again you set above in the Patterns section: “Clean up text” or whatever.
Here you have different command options: you can perform regex replacements on selected text (like above), or the whole file you have open. It’s better to stick with selection here as well: tick Apply to Selection.

Now disable and re-enable the Apply Patterns plugin to and now the command will be available on the Palette. You can bind a hotkey to it.

How to use

Select all (only) the text you want to perform cleanup operations on. Run the Templater template or the Apply Patterns command.
If you accidentally shrank text you did not want to, hit CNTRL+Z and select only the text that needs clean up treatment.

You can see why selecting the text manually is a better idea:
ShareX_qr58BqQPnB


Script is inspired by AlanG’s post.

4 Likes

I use the Text Format plugin for this. It has the useful command called “Merge broken paragraphs in selection”

4 Likes

My post was not necessarily written to provide the single solution to a problem. It was written for people to have a sample on how to create anything to do with text manipulation. Granted, if people don’t know any regex or find it scary, they will not find it too useful.

I wrote it because I myself would have liked something like this written up when I knew next to nothing about Obsidian and what awesome customizations and automations one can do in it.

1 Like

I have the linter plugin installed, but it seems it cant do that?

I never used the paste capabilities of that plugin, if that’s what you’re pointing at. My methods don’t involve Linter.

1 Like

yes I understand, I jsut though that maybe there would be a solution to use linter somehow, if installed and used anyways. I will ask them directly :slight_smile:

1 Like

Hi all,

I just came across this program that might make things simpler:

This program automatically replaces the line breaks with spaces, so no more need to be manually backspacing. The program runs in the background, you just have to copy paste.